Los Alamos National Laboratory
Sitemap  |   Lab Home  |   Phone
 
 
Computer Science Research :: Failure Data

Operational Data to Support and Enable Computer Science Research

In order to enable open computer science research access to computer operational data is desparately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desparately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.

All we ask is that if you use these data in your research that you recognize Los Alamos National Laboratory for providing these data.

Computer systems from which operational data was drawn

system
CMU paper
number
system
data machine
number
system
type
number
nodes
number
cpus
cpus/node install
date
production
date
decommision
date
fru mem
per
node
cpu
type
number of
interconnects
use type
1 7 smp 1 8 8 before tracking before tracking Dec-99 part 16 3 0 compute
2 24 smp 1 32 32 before tracking before tracking Dec-03 part 8 7 1 compute
3 22 smp 1 4 4 before tracking before tracking Apr-03 part 1 6 0 compute
4 8 cluster 164 328 2 Mar-01 Apr-01 current part 1 4 1 compute
' '
5 20 cluster 512 2048 4 Oct-01 Dec-01 current part 16 2 2 compute
' '
6 21 cluster 128 512 4 Aug-01 Sep-01 Jan-02 part 16 2 2 compute
7 18 cluster 1024 4096 4 Mar-02 May-02 current part 16 2 2 compute
' '
8 19 cluster 1024 4096 4 Aug-02 Oct-02 current part 16 2 2 compute
' '
9 3 cluster 128 512 4 Aug-03 Sep-03 current part 4 2 1 compute
' '
10 4 cluster 128 512 4 Aug-03 Sep-03 current part 4 2 1 compute
' '
11 5 cluster 128 512 4 Aug-03 Sep-03 current part 4 2 1 compute
' '
12 6 cluster 32 128 4 Aug-03 Sep-03 current part 16 2 1 compute
' '
13 14 cluster 128 256 2 Aug-03 Sep-03 current node 4 2 1 compute
' '
14 9 cluster 256 512 2 Aug-03 Sep-03 current node 4 2 1 compute
' '
15 10 cluster 256 512 2 Aug-03 Sep-03 current node 4 2 1 compute
' '
16 11 cluster 256 512 2 Aug-03 Sep-03 current node 4 2 1 compute
' '
17 13 cluster 256 512 2 Aug-03 Sep-03 current node 4 2 1 compute
' '
18 12 cluster 512 1024 2 Aug-03 Sep-03 current node 4 2 1 compute
' '
19 16 cluster 16 2048 128 Oct-96 Dec-96 Sep-02 part 32 1 4 compute
20 2 cluster 49 6152 128 Nov-96 Jan-97 current part 128 1 12 graphics.compute
21 23 cluster 5 544 128 Oct-98 Oct-98 Dec-04 part 128 1 4 compute
22 15 numa 1 265 256 Nov-04 Nov-04 current part 1024 5 0 compute
-- 25 cluster 128 256 2 Dec-02 N/A (testbad) current part 4G 8 1 compute

System 20 job exit status information

Job Status Values 
Status                                   Description

running           All processes are scheduled
suspended         All processes are suspended
finished          All processes have exited
hung              One or more nodes is not responding
aborted           User aborted job (Ctrl/C)
failed            One or more of the nodes in use by this job crashed or was configured out
killed            An application process was killed by a signal  (this can be a user signal or other signal)
expired           Time limit has expired
syskill           Job was killed by an administrative user  (can be because of a problem or because an admin needed to do maintenance etc.)

We are happy to host these data to enable Research.

If you have data that you can make available and need a place to host these data or if you know of this type of data available on the internet, and want to share a link with us, feel free to email us just click here.

Reliability/Interrupt/Failure/Usage Data sets available

These data have been used in a paper at the Carnegie Mellon University Parallel Data Lab. This paper is an excellent example of how these data can be used in computer science research endeavors. The paper numbers the systems differently than the numbering in the data itself. In the table below both the original system number in the data and the CMU paper number is given. This is a link to the CMU paper: CMU Failure Project Site

Additionally, other Los Alamos related supercomputer failure papers:

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Bianca Schroeder at CMU has been kind enough to provide a FAQ about the failure data set. Click here to see that FAQ

Reliability/Interrupt/Failure/Usage Data Sets

description size records name get readme get file
all systems failure/interrupt
data 1996-2005
2963538 23741 LA-UR-05-7318-failure-data-1996-2005.csv
email:
system 20 usage with domain
info
51675641 489376 LA-UR-06-0803-MX20_NODES_0_TO_255_DOM.TXT
email:
system 20 usage with node info
nodes number from zero
43926669 489376 LA-UR-06-0803-MX20_NODES_0_TO_255_NODE-Z.TXT
email:
system 20 event info
nodes number from zero
33120015 433490 LA-UR-06-0803-MX20_NODES_0_TO_255_EVENTS.csv
email:
system 20 node internal disk failure info
nodes number from zero
209 14 LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT
email:
system 15 usage with node info
nodes number from zero
2416139 17823 LA-UR-06-0999-MX15-NODE-Z.TXT
email:
system 16 usage with node info
nodes number from one
321293488 1630479 LA-UR-06-1446-MX16-NODE-NOZ.TXT
email:
system 23 usage with node info
nodes number from one
60674531 654927 LA-UR-06-1447-MX23-NODE-NOZ.TXT
email:
system 8 usage with node info
nodes number from one
67291020 763293 LA-UR-06-3194-MX8-NODE-NOZ.TXT
email:

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy

Privacy Policy | Copyright © 1993-2006 UC | Web Contact