Los Alamos National Laboratory
Sitemap  |   Lab Home  |   Phone
 
 
Computer Science Research :: Failure Data

Operational Data to Support and Enable Computer Science Research

In order to enable open computer science research access to computer operational data is desparately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desparately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.

All we ask is that if you use these data in your research that you recognize Los Alamos National Laboratory for providing these data.

The first set of data was made available in 2005 for times spanning 1995-2005, an update is being made available that adds 2005-09/2011 failure data

Computer systems from which operational data was drawn

CMU paper number system number system type number nodes total cpus or cores cpus or cores per node install date production date decommission date FRU type memory per node in Gbytes cpu type num mem type num number interconnects use type notes
20 2 cluster  49 6152 128 11/1/1996 1/1/1997 2/1/2007 part  128 1 1 12 graphics.compute Includes front end
9 3 cluster  128 512 4 8/1/2003 9/1/2003 1/6/2009 part  4 2 2 1 compute 
' '
10 4 cluster  128 512 4 8/1/2003 9/1/2003 1/6/2009 part  4 2 2 1 compute 
' '
11 5 cluster  128 512 4 8/1/2003 9/1/2003 1/6/2009 part  4 2 2 1 compute 
' '
12 6 cluster  32 128 4 8/1/2003 9/1/2003 1/6/2009 part  16 2 2 1 compute 
' '
1 7 smp  1 8 8 before tracking  before tracking  12/1/1999 part  16 3 3 0 compute Early Failure Data not Available
4 8 cluster  164 328 2 3/1/2001 4/1/2001 10/1/2005 part  1 4 4 1 compute  Data assumes full 384 procs but cluster was not always that big
' '
14 9 cluster  256 512 2 8/1/2003 9/1/2003 10/5/2009 node  4 2 8 1 compute 
' '
15 10 cluster  256 512 2 8/1/2003 9/1/2003 10/5/2009 node  4 2 8 1 compute 
' '
16 11 cluster  256 512 2 8/1/2003 9/1/2003 10/5/2009 node  4 2 8 1 compute 
' '
18 12 cluster  512 1024 2 8/1/2003 9/1/2003 10/5/2009 node  4 2 8 1 compute  When two units were joined as one 512 node system
' '
17 13 cluster  256 512 2 8/1/2003 9/1/2003 10/5/2009 node  4 2 8 1 compute 
' '
13 14 cluster  128 256 2 8/1/2003 9/1/2003 10/5/2009 node  4 2 8 1 compute 
' '
22 15 numa  1 265 256 11/1/2004 11/1/2004 10/5/2009 part  1024 5 5 1 compute
19 16 cluster  16 2048 128 10/1/1996 12/1/1996 9/1/2002 part  32 1 1 4 compute Part of time with fewer nodes
7 18 cluster  1024 4096 4 3/1/2002 5/1/2002 3/1/2008 part  16 2 2 2 compute 
' '
8 19 cluster  1024 4096 4 8/1/2002 10/1/2002 3/1/2008 part  16 2 2 2 compute 
' '
5 20 cluster  512 2048 4 10/1/2001 12/1/2001 3/1/2008 part  16 2 2 2 compute 
' '
6 21 cluster  128 512 4 8/1/2001 9/1/2001 1/1/2002 part  16 2 2 2 compute
3 22 smp  1 4 4 before tracking  before tracking  4/1/2003 part  1 6 6 0 compute
21 23 cluster  5 544 128 10/1/1998 10/1/1998 12/1/2004 part  128 1 1 4 compute Includes front end
2 24 smp  1 32 32 before tracking  before tracking  12/1/2003 part  8 7 7 0 compute
na 25 cluster  128 256 2 12/2/2011 N/A (testbad)  current  node  4 8 8 1 testbed
na 26 cluster  128 1024 8 10/1/2006 N/A (testbad)  current  node  16 9 9 1 testbed
' '
na 27 cluster  139 1112 8 10/1/2006 12/1/2006 current  node  16 9 9 1 compute
' '
na  28 cluster  1834 14672 8 10/1/2006 5/1/2007 current  node  32 9 9 1 compute
na 29 cluster  12 48 4 1/1/2009 4/15/2009 current  node  32 10 9 1 compute
' '
na 30 cluster  360 1440 4 10/15/2008 2/1/2009 current  node  32 10 9 1 compute
' '
na 31 cluster  12 48 4 10/15/2008 2/10/2009 current  node  32 10 9 1 compute
' '
na 32 cluster  12 48 4 7/1/2009 9/1/2009 current  node  32 10 9 1 compute
na 33 cluster  3060 12240 4 10/15/2008 2/1/2009 current  node  32 10 9 1 compute machine was in one network for 9 months or so before taking to a different network
na 34 cluster  272 4352 16 6/1/2008 12/15/2008 current  node  32 11 10 1 compute machine had sdc issues, took long time to certify
na 35 cluster  64 1024 16 7/1/2008 6/1/2009 current  node  32 11 10 1 compute machine had sdc issues, took long time to certify
na 36 cluster  408 5760 16 7/1/2008 6/1/2009 current  node  32 11 10 1 compute machine had sdc issues, took long time to certify
na 37 cluster  416 13312 32 9/1/2010 2/15/2011 current  node  64 11 11 1 compute
na 38 cluster  592 4736 8 11/1/2010 3/1/2011 current  node  24 12 11 1 compute
na 39 cluster  620 4960 8 11/1/2010 3/1/2100 current  node  24 12 11 1 compute
na 40 cluster  68 1088 16 10/1/2010 12/1/2010 current  node  32 13 11 1 compute
na 41 cluster  6654 107264 16 10/1/2010 2/1/2011 current  node  32 13 11 1 compute
na 46 cluster  256 512 2 10/1/2005 11/15/2005 10/5/2009 node  16 2 8 1 compute 
na 47 cluster  256 512 2 10/1/2005 11/15/2005 10/5/2009 node  8 2 8 1 compute 
na 48 cluster  256 512 2 10/1/2005 11/15/2005 10/5/2009 node  8 2 8 1 compute 
na 49 cluster  256 512 2 10/1/2005 11/15/2005 10/5/2009 node  8 2 8 1 compute 
na 50 cluster  256 512 2 10/1/2005 11/15/2005 10/5/2009 node  8 2 8 1 compute 
na 51 cluster  256 512 2 10/1/2005 11/15/2005 10/5/2009 node  8 2 8 1 compute 
na 52 cluster  256 1024 4 10/1/2005 11/15/2005 10/5/2009 node  8 9 8 1 compute 
na 53 cluster  300 600 2 8/1/2003 9/1/2003 10/5/2009 node  8 2 8 1 compute 
na 54 cluster  256 512 2 8/1/2005 10/1/2005 10/5/2009 node  8 2 8 1 compute 
na 55 cluster  256 512 2 8/1/2005 10/1/2005 10/5/2009 node  8 2 8 1 compute 
na 56 cluster  128 256 2 8/1/2005 11/1/2005 10/5/2009 node  16 2 8 1 compute 
na 57 cluster  258 516 2 12/1/2005 6/1/2006 3/15/2011 node  8 2 8 1 compute SDC issue  = heat and voltage delayed prod
na 58 cluster  258 516 2 12/1/2005 6/1/2006 3/15/2011 node  8 2 8 1 compute SDC issue  = heat and voltage delayed prod
na 59 cluster  258 516 2 12/1/2005 6/1/2006 3/15/2011 node  8 2 8 1 compute SDC issue  = heat and voltage delayed prod
na 60 cluster  258 516 2 12/1/2005 6/1/2006 3/15/2011 node  8 2 8 1 compute SDC issue  = heat and voltage delayed prod
na 61 cluster  258 516 2 12/1/2005 6/1/2006 3/15/2011 node  8 2 8 1 compute SDC issue  = heat and voltage delayed prod
na 62 cluster  32 64 2 8/1/2007 10/1/2007 3/1/2011 node  4 2 8 1 compute
' '
na 63 cluster  128 256 2 8/1/2006 10/1/2006 8/1/2010 node  8 2 8 1 compute
na 64 cluster 180 720 4 10/15/2008 2/1/2009 current node  32 10 9 1 compute
na 65 cluster 136 2176 16 6/1/2008 12/15/2008 current node  32 11 10 1 compute machine had SDC issues, took long time to certify
na 67 cluster 12960 38400 2 and 4 2005   2 15 1 compute
na 68 cluster 1536 12288 8 Q3 2005 8/2005 8/2010   16 and 32 16 1 compute
na 69 cluster 36864 147456 4 12/2009 current   4 17 3 compute

Download system layout files here

' machine 3 '
' machine 4 '
' machine 5 '
' machine 6 '
' machine 8 '
' machine 9 '
' machine 10 '
' machine 11 '
' mahcine 12 '
' machine 13 '
' machine 14 '
' mahcine 18 '
' mahcine 19 '
' machine 20 '
' machine 26 '
' machine 27 '
' machine 29 '
' mahcine 30 '
' machine 31 '

System 20 job exit status information

Job Status Values 
Status                                   Description

running           All processes are scheduled
suspended         All processes are suspended
finished          All processes have exited
hung              One or more nodes is not responding
aborted           User aborted job (Ctrl/C)
failed            One or more of the nodes in use by this job crashed or was configured out
killed            An application process was killed by a signal  (this can be a user signal or other signal)
expired           Time limit has expired
syskill           Job was killed by an administrative user  (can be because of a problem or because an admin needed to do maintenance etc.)

We are happy to host these data to enable Research.

If you have data that you can make available and need a place to host these data or if you know of this type of data available on the internet, and want to share a link with us, feel free to email us just click here.

Reliability/Interrupt/Failure/Usage Data sets available

These data have been used in a paper at the Carnegie Mellon University Parallel Data Lab. This paper is an excellent example of how these data can be used in computer science research endeavors. The paper numbers the systems differently than the numbering in the data itself. In the table below both the original system number in the data and the CMU paper number is given. This is a link to the CMU paper: CMU Failure Project Site

Additionally, other Los Alamos related supercomputer failure papers:

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Bianca Schroeder at CMU has been kind enough to provide a FAQ about the 1995-2005 failure data set. Click here to see that FAQ

Reliability/Interrupt/Failure/Usage Data Sets for the 1995-2005 timeframe

description size records name get readme get file
all systems failure/interrupt
data 1996-2005
2963538 23741 LA-UR-05-7318-failure-data-1996-2005.csv
email:
system 20 usage with domain
info
51675641 489376 LA-UR-06-0803-MX20_NODES_0_TO_255_DOM.TXT
email:
system 20 usage with node info
nodes number from zero
43926669 489376 LA-UR-06-0803-MX20_NODES_0_TO_255_NODE-Z.TXT
email:
system 20 event info
nodes number from zero
33120015 433490 LA-UR-06-0803-MX20_NODES_0_TO_255_EVENTS.csv
email:
system 20 node internal disk failure info
nodes number from zero
209 14 LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT
email:
system 15 usage with node info
nodes number from zero
2416139 17823 LA-UR-06-0999-MX15-NODE-Z.TXT
email:
system 16 usage with node info
nodes number from one
321293488 1630479 LA-UR-06-1446-MX16-NODE-NOZ.TXT
email:
system 23 usage with node info
nodes number from one
60674531 654927 LA-UR-06-1447-MX23-NODE-NOZ.TXT
email:
system 8 usage with node info
nodes number from one
67291020 763293 LA-UR-06-3194-MX8-NODE-NOZ.TXT
email:

Reliability/Interrupt/Failure/Usage Data Sets for the 1995-2011 timeframe

Consolidated failure data is for 1995 all the way to 09/2005 is availabler here.

Available Soon


Operated by Los Alamos National Security, LLC for the U.S. Department of Energy

Privacy Policy | Copyright © 1993-2006 UC | Web Contact