Operational Data to Support and Enable Computer Science Research
In order to enable open computer science research access to computer operational data is desparately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desparately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.
All we ask is that if you use these data in your research that you recognize Los Alamos National Laboratory for providing these data.
The first set of data was made available in 2005 for times spanning 1995-2005, an update is being made available that adds 2005-09/2011 failure data
Computer systems from which operational data was drawn
CMU paper number | system number | system type | number nodes | total cpus or cores | cpus or cores per node | install date | production date | decommission date | FRU type | memory per node in Gbytes | cpu type num | mem type num | number interconnects | use type | notes | |
20 | 2 | cluster | 49 | 6152 | 128 | 11/1/1996 | 1/1/1997 | 2/1/2007 | part | 128 | 1 | 1 | 12 | graphics.compute | Includes front end | |
9 | 3 | cluster | 128 | 512 | 4 | 8/1/2003 | 9/1/2003 | 1/6/2009 | part | 4 | 2 | 2 | 1 | compute | ||
10 | 4 | cluster | 128 | 512 | 4 | 8/1/2003 | 9/1/2003 | 1/6/2009 | part | 4 | 2 | 2 | 1 | compute | ||
11 | 5 | cluster | 128 | 512 | 4 | 8/1/2003 | 9/1/2003 | 1/6/2009 | part | 4 | 2 | 2 | 1 | compute | ||
12 | 6 | cluster | 32 | 128 | 4 | 8/1/2003 | 9/1/2003 | 1/6/2009 | part | 16 | 2 | 2 | 1 | compute | ||
1 | 7 | smp | 1 | 8 | 8 | before tracking = span> | before tracking = span> | 12/1/1999 | part | 16 | 3 | 3 | 0 | compute | Early Failure Data not Available | |
4 | 8 | cluster | 164 | 328 | 2 | 3/1/2001 | 4/1/2001 | 10/1/2005 | part | 1 | 4 | 4 | 1 | compute | Data assumes full 384 procs but cluster was not always that big | |
14 | 9 | cluster | 256 | 512 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 4 | 2 | 8 | 1 | compute | ||
15 | 10 | cluster | 256 | 512 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 4 | 2 | 8 | 1 | compute | ||
16 | 11 | cluster | 256 | 512 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 4 | 2 | 8 | 1 | compute | ||
18 | 12 | cluster | 512 | 1024 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 4 | 2 | 8 | 1 | compute | When two units were joined as one 512 node system | |
17 | 13 | cluster | 256 | 512 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 4 | 2 | 8 | 1 | compute | ||
13 | 14 | cluster | 128 | 256 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 4 | 2 | 8 | 1 | compute | ||
22 | 15 | numa | 1 | 265 | 256 | 11/1/2004 | 11/1/2004 | 10/5/2009 | part | 1024 | 5 | 5 | 1 | compute | ||
19 | 16 | cluster | 16 | 2048 | 128 | 10/1/1996 | 12/1/1996 | 9/1/2002 | part | 32 | 1 | 1 | 4 | compute | Part of time with fewer nodes | |
7 | 18 | cluster | 1024 | 4096 | 4 | 3/1/2002 | 5/1/2002 | 3/1/2008 | part | 16 | 2 | 2 | 2 | compute | ||
8 | 19 | cluster | 1024 | 4096 | 4 | 8/1/2002 | 10/1/2002 | 3/1/2008 | part | 16 | 2 | 2 | 2 | compute | ||
5 | 20 | cluster | 512 | 2048 | 4 | 10/1/2001 | 12/1/2001 | 3/1/2008 | part | 16 | 2 | 2 | 2 | compute | ||
6 | 21 | cluster | 128 | 512 | 4 | 8/1/2001 | 9/1/2001 | 1/1/2002 | part | 16 | 2 | 2 | 2 | compute | ||
3 | 22 | smp | 1 | 4 | 4 | before tracking = span> | before tracking = span> | 4/1/2003 | part | 1 | 6 | 6 | 0 | compute | ||
21 | 23 | cluster | 5 | 544 | 128 | 10/1/1998 | 10/1/1998 | 12/1/2004 | part | 128 | 1 | 1 | 4 | compute | Includes front end | |
2 | 24 | smp | 1 | 32 | 32 | before tracking = span> | before tracking = span> | 12/1/2003 | part | 8 | 7 | 7 | 0 | compute | ||
na | 25 | cluster | 128 | 256 | 2 | 12/2/2011 | N/A (testbad) | current | node | 4 | 8 | 8 | 1 | testbed | ||
na | 26 | cluster | 128 | 1024 | 8 | 10/1/2006 | N/A (testbad) | current | node | 16 | 9 | 9 | 1 | testbed | ||
na | 27 | cluster | 139 | 1112 | 8 | 10/1/2006 | 12/1/2006 | current | node | 16 | 9 | 9 | 1 | compute | ||
na | 28 | cluster | 1834 | 14672 | 8 | 10/1/2006 | 5/1/2007 | current | node | 32 | 9 | 9 | 1 | compute | ||
na | 29 | cluster | 12 | 48 | 4 | 1/1/2009 | 4/15/2009 | current | node | 32 | 10 | 9 | 1 | compute | ||
na | 30 | cluster | 360 | 1440 | 4 | 10/15/2008 | 2/1/2009 | current | node | 32 | 10 | 9 | 1 | compute | ||
na | 31 | cluster | 12 | 48 | 4 | 10/15/2008 | 2/10/2009 | current | node | 32 | 10 | 9 | 1 | compute | ||
na | 32 | cluster | 12 | 48 | 4 | 7/1/2009 | 9/1/2009 | current | node | 32 | 10 | 9 | 1 | compute | ||
na | 33 | cluster | 3060 | 12240 | 4 | 10/15/2008 | 2/1/2009 | current | node | 32 | 10 | 9 | 1 | compute | machine was in one network for 9 months or so before taking to a different network | |
na | 34 | cluster | 272 | 4352 | 16 | 6/1/2008 | 12/15/2008 | current | node | 32 | 11 | 10 | 1 | compute | machine had sdc issues, took long time to certify= td> | |
na | 35 | cluster | 64 | 1024 | 16 | 7/1/2008 | 6/1/2009 | current | node | 32 | 11 | 10 | 1 | compute | machine had sdc issues, took long time to certify= td> | |
na | 36 | cluster | 408 | 5760 | 16 | 7/1/2008 | 6/1/2009 | current | node | 32 | 11 | 10 | 1 | compute | machine had sdc issues, took long time to certify= td> | |
na | 37 | cluster | 416 | 13312 | 32 | 9/1/2010 | 2/15/2011 | current | node | 64 | 11 | 11 | 1 | compute | ||
na | 38 | cluster | 592 | 4736 | 8 | 11/1/2010 | 3/1/2011 | current | node | 24 | 12 | 11 | 1 | compute | ||
na | 39 | cluster | 620 | 4960 | 8 | 11/1/2010 | 3/1/2100 | current | node | 24 | 12 | 11 | 1 | compute | ||
na | 40 | cluster | 68 | 1088 | 16 | 10/1/2010 | 12/1/2010 | current | node | 32 | 13 | 11 | 1 | compute | ||
na | 41 | cluster | 6654 | 107264 | 16 | 10/1/2010 | 2/1/2011 | current | node | 32 | 13 | 11 | 1 | compute | ||
na | 46 | cluster | 256 | 512 | 2 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 16 | 2 | 8 | 1 | compute | ||
na | 47 | cluster | 256 | 512 | 2 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 48 | cluster | 256 | 512 | 2 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 49 | cluster | 256 | 512 | 2 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 50 | cluster | 256 | 512 | 2 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 51 | cluster | 256 | 512 | 2 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 52 | cluster | 256 | 1024 | 4 | 10/1/2005 | 11/15/2005 | 10/5/2009 | node | 8 | 9 | 8 | 1 | compute | ||
na | 53 | cluster | 300 | 600 | 2 | 8/1/2003 | 9/1/2003 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 54 | cluster | 256 | 512 | 2 | 8/1/2005 | 10/1/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 55 | cluster | 256 | 512 | 2 | 8/1/2005 | 10/1/2005 | 10/5/2009 | node | 8 | 2 | 8 | 1 | compute | ||
na | 56 | cluster | 128 | 256 | 2 | 8/1/2005 | 11/1/2005 | 10/5/2009 | node | 16 | 2 | 8 | 1 | compute | ||
na | 57 | cluster | 258 | 516 | 2 | 12/1/2005 | 6/1/2006 | 3/15/2011 | node | 8 | 2 | 8 | 1 | compute | SDC issue = heat and voltage delayed prod | |
na | 58 | cluster | 258 | 516 | 2 | 12/1/2005 | 6/1/2006 | 3/15/2011 | node | 8 | 2 | 8 | 1 | compute | SDC issue = heat and voltage delayed prod | |
na | 59 | cluster | 258 | 516 | 2 | 12/1/2005 | 6/1/2006 | 3/15/2011 | node | 8 | 2 | 8 | 1 | compute | SDC issue = heat and voltage delayed prod | |
na | 60 | cluster | 258 | 516 | 2 | 12/1/2005 | 6/1/2006 | 3/15/2011 | node | 8 | 2 | 8 | 1 | compute | SDC issue = heat and voltage delayed prod | |
na | 61 | cluster | 258 | 516 | 2 | 12/1/2005 | 6/1/2006 | 3/15/2011 | node | 8 | 2 | 8 | 1 | compute | SDC issue = heat and voltage delayed prod | |
na | 62 | cluster | 32 | 64 | 2 | 8/1/2007 | 10/1/2007 | 3/1/2011 | node | 4 | 2 | 8 | 1 | compute | ||
na | 63 | cluster | 128 | 256 | 2 | 8/1/2006 | 10/1/2006 | 8/1/2010 | node | 8 | 2 | 8 | 1 | compute | ||
na | 64 | cluster | 180 | 720 | 4 | 10/15/2008 | 2/1/2009 | current | node | 32 | 10 | 9 | 1 | compute | ||
na | 65 | cluster | 136 | 2176 | 16 | 6/1/2008 | 12/15/2008 | current | node | 32 | 11 | 10 | 1 | compute | machine had SDC issues, took long time to certify | |
na | 67 | cluster | 12960 | 38400 | 2 and 4 | 2005 | 2 | 15 | 1 | compute | ||||||
na | 68 | cluster | 1536 | 12288 | 8 | Q3 2005 | 8/2005 | 8/2010 | 16 and 32 | 16 | 1 | compute | ||||
na | 69 | cluster | 36864 | 147456 | 4 | 12/2009 | current | 4 | 17 | 3 | compute |
Download system layout files here
System 20 job exit status information
Job Status Values Status Description running All processes are scheduled suspended All processes are suspended finished All processes have exited hung One or more nodes is not responding aborted User aborted job (Ctrl/C) failed One or more of the nodes in use by this job crashed or was configured out killed An application process was killed by a signal (this can be a user signal or other signal) expired Time limit has expired syskill Job was killed by an administrative user (can be because of a problem or because an admin needed to do maintenance etc.)
We are happy to host these data to enable Research.
If you have data that you can make available and need a place to host these data or if you know of this type of data available on the internet, and want to share a link with us, feel free to email us just click here.
Reliability/Interrupt/Failure/Usage Data sets available
These data have been used in a paper at the Carnegie Mellon University Parallel Data Lab. This paper is an excellent example of how these data can be used in computer science research endeavors. The paper numbers the systems differently than the numbering in the data itself. In the table below both the original system number in the data and the CMU paper number is given. This is a link to the CMU paper: CMU Failure Project Site
Additionally, other Los Alamos related supercomputer failure papers:
- Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer
- Estimating Reliability Trends for the World's Faster Computer [Blue Mountain circa 2000]
- Excerpt from Cray report in 1976 concerning Cray 1 reliabilty/mean time to interrupt
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Bianca Schroeder at CMU has been kind enough to provide a FAQ about the 1995-2005 failure data set. Click here to see that FAQ
Reliability/Interrupt/Failure/Usage Data Sets for the 1995-2005 timeframe
description | size | records | name | get readme | get file |
---|---|---|---|---|---|
all systems failure/interrupt data 1996-2005 | 2963538 | 23741 | LA-UR-05-7318-failure-data-1996-2005.csv | ||
system 20 usage with domain info | 51675641 | 489376 | LA-UR-06-0803-MX20_NODES_0_TO_255_DOM.TXT | ||
system 20 usage with node info nodes number from zero | 43926669 | 489376 | LA-UR-06-0803-MX20_NODES_0_TO_255_NODE-Z.TXT | ||
system 20 event info nodes number from zero | 33120015 | 433490 | LA-UR-06-0803-MX20_NODES_0_TO_255_EVENTS.csv | ||
system 20 node internal disk failure info nodes number from zero | 209 | 14 | LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT | ||
system 15 usage with node info nodes number from zero | 2416139 | 17823 | LA-UR-06-0999-MX15-NODE-Z.TXT | ||
system 16 usage with node info nodes number from one | 321293488 | 1630479 | LA-UR-06-1446-MX16-NODE-NOZ.TXT | ||
system 23 usage with node info nodes number from one | 60674531 | 654927 | LA-UR-06-1447-MX23-NODE-NOZ.TXT | ||
system 8 usage with node info nodes number from one | 67291020 | 763293 | LA-UR-06-3194-MX8-NODE-NOZ.TXT |
Reliability/Interrupt/Failure/Usage Data Sets for the 1995-2011 timeframe
Consolidated failure data is for 1995 all the way to 09/2005 is availabler here.Available Soon