Operational Data to Support and Enable Computer Science Research
In order to enable open computer science research access to computer operational data is desparately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desparately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.
All we ask is that if you use these data in your research that you recognize Los Alamos National Laboratory for providing these data.
Computer systems from which operational data was drawn
system CMU paper number |
system data machine number |
system type |
number nodes |
number cpus |
cpus/node | install date |
production date |
decommision date |
fru | mem per node |
cpu type |
number of interconnects |
use type | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7 | smp | 1 | 8 | 8 | before tracking | before tracking | Dec-99 | part | 16 | 3 | 0 | compute | ||
2 | 24 | smp | 1 | 32 | 32 | before tracking | before tracking | Dec-03 | part | 8 | 7 | 1 | compute | ||
3 | 22 | smp | 1 | 4 | 4 | before tracking | before tracking | Apr-03 | part | 1 | 6 | 0 | compute | ||
4 | 8 | cluster | 164 | 328 | 2 | Mar-01 | Apr-01 | current | part | 1 | 4 | 1 | compute | ||
5 | 20 | cluster | 512 | 2048 | 4 | Oct-01 | Dec-01 | current | part | 16 | 2 | 2 | compute | ||
6 | 21 | cluster | 128 | 512 | 4 | Aug-01 | Sep-01 | Jan-02 | part | 16 | 2 | 2 | compute | ||
7 | 18 | cluster | 1024 | 4096 | 4 | Mar-02 | May-02 | current | part | 16 | 2 | 2 | compute | ||
8 | 19 | cluster | 1024 | 4096 | 4 | Aug-02 | Oct-02 | current | part | 16 | 2 | 2 | compute | ||
9 | 3 | cluster | 128 | 512 | 4 | Aug-03 | Sep-03 | current | part | 4 | 2 | 1 | compute | ||
10 | 4 | cluster | 128 | 512 | 4 | Aug-03 | Sep-03 | current | part | 4 | 2 | 1 | compute | ||
11 | 5 | cluster | 128 | 512 | 4 | Aug-03 | Sep-03 | current | part | 4 | 2 | 1 | compute | ||
12 | 6 | cluster | 32 | 128 | 4 | Aug-03 | Sep-03 | current | part | 16 | 2 | 1 | compute | ||
13 | 14 | cluster | 128 | 256 | 2 | Aug-03 | Sep-03 | current | node | 4 | 2 | 1 | compute | ||
14 | 9 | cluster | 256 | 512 | 2 | Aug-03 | Sep-03 | current | node | 4 | 2 | 1 | compute | ||
15 | 10 | cluster | 256 | 512 | 2 | Aug-03 | Sep-03 | current | node | 4 | 2 | 1 | compute | ||
16 | 11 | cluster | 256 | 512 | 2 | Aug-03 | Sep-03 | current | node | 4 | 2 | 1 | compute | ||
17 | 13 | cluster | 256 | 512 | 2 | Aug-03 | Sep-03 | current | node | 4 | 2 | 1 | compute | ||
18 | 12 | cluster | 512 | 1024 | 2 | Aug-03 | Sep-03 | current | node | 4 | 2 | 1 | compute | ||
19 | 16 | cluster | 16 | 2048 | 128 | Oct-96 | Dec-96 | Sep-02 | part | 32 | 1 | 4 | compute | ||
20 | 2 | cluster | 49 | 6152 | 128 | Nov-96 | Jan-97 | current | part | 128 | 1 | 12 | graphics.compute | ||
21 | 23 | cluster | 5 | 544 | 128 | Oct-98 | Oct-98 | Dec-04 | part | 128 | 1 | 4 | compute | ||
22 | 15 | numa | 1 | 265 | 256 | Nov-04 | Nov-04 | current | part | 1024 | 5 | 0 | compute | ||
-- | 25 | cluster | 128 | 256 | 2 | Dec-02 | N/A (testbad) | current | part | 4G | 8 | 1 | compute | ||
System 20 job exit status information
Job Status Values Status Description running All processes are scheduled suspended All processes are suspended finished All processes have exited hung One or more nodes is not responding aborted User aborted job (Ctrl/C) failed One or more of the nodes in use by this job crashed or was configured out killed An application process was killed by a signal (this can be a user signal or other signal) expired Time limit has expired syskill Job was killed by an administrative user (can be because of a problem or because an admin needed to do maintenance etc.)
We are happy to host these data to enable Research.
If you have data that you can make available and need a place to host these data or if you know of this type of data available on the internet, and want to share a link with us, feel free to email us just click here.
Reliability/Interrupt/Failure/Usage Data sets available
These data have been used in a paper at the Carnegie Mellon University Parallel Data Lab. This paper is an excellent example of how these data can be used in computer science research endeavors. The paper numbers the systems differently than the numbering in the data itself. In the table below both the original system number in the data and the CMU paper number is given. This is a link to the CMU paper: CMU Failure Project Site
Additionally, other Los Alamos related supercomputer failure papers:
- Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer
- Estimating Reliability Trends for the World's Faster Computer [Blue Mountain circa 2000]
- Excerpt from Cray report in 1976 concerning Cray 1 reliabilty/mean time to interrupt
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Bianca Schroeder at CMU has been kind enough to provide a FAQ about the failure data set. Click here to see that FAQ
Reliability/Interrupt/Failure/Usage Data Sets
description | size | records | name | get readme | get file |
---|---|---|---|---|---|
all systems failure/interrupt data 1996-2005 | 2963538 | 23741 | LA-UR-05-7318-failure-data-1996-2005.csv | ||
system 20 usage with domain info | 51675641 | 489376 | LA-UR-06-0803-MX20_NODES_0_TO_255_DOM.TXT | ||
system 20 usage with node info nodes number from zero | 43926669 | 489376 | LA-UR-06-0803-MX20_NODES_0_TO_255_NODE-Z.TXT | ||
system 20 event info nodes number from zero | 33120015 | 433490 | LA-UR-06-0803-MX20_NODES_0_TO_255_EVENTS.csv | ||
system 20 node internal disk failure info nodes number from zero | 209 | 14 | LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT | ||
system 15 usage with node info nodes number from zero | 2416139 | 17823 | LA-UR-06-0999-MX15-NODE-Z.TXT | ||
system 16 usage with node info nodes number from one | 321293488 | 1630479 | LA-UR-06-1446-MX16-NODE-NOZ.TXT | ||
system 23 usage with node info nodes number from one | 60674531 | 654927 | LA-UR-06-1447-MX23-NODE-NOZ.TXT | ||
system 8 usage with node info nodes number from one | 67291020 | 763293 | LA-UR-06-3194-MX8-NODE-NOZ.TXT |