Operational Data to Support and Enable Computer Science Research

In order to enable open computer science research access to computer operational data is desparately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desparately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.

All we ask is that if you use these data in your research that you recognize Los Alamos National Laboratory for providing these data.

The first set of data was made available in 2005 for times spanning 1995-2005, an update is being made available that adds 2005-09/2011 failure data

Computer systems from which operational data was drawn

CMU paper number	system number	system type	number nodes	total cpus or cores	cpus or cores per node	install date	production date	decommission date	FRU type	memory per node in Gbytes	cpu type num	mem type num	number interconnects	use type	notes
20	2	cluster	49	6152	128	11/1/1996	1/1/1997	2/1/2007	part	128	1	1	12	graphics.compute	Includes front end
9	3	cluster	128	512	4	8/1/2003	9/1/2003	1/6/2009	part	4	2	2	1	compute		' '
10	4	cluster	128	512	4	8/1/2003	9/1/2003	1/6/2009	part	4	2	2	1	compute		' '
11	5	cluster	128	512	4	8/1/2003	9/1/2003	1/6/2009	part	4	2	2	1	compute		' '
12	6	cluster	32	128	4	8/1/2003	9/1/2003	1/6/2009	part	16	2	2	1	compute		' '
1	7	smp	1	8	8	before tracking	before tracking	12/1/1999	part	16	3	3	0	compute	Early Failure Data not Available
4	8	cluster	164	328	2	3/1/2001	4/1/2001	10/1/2005	part	1	4	4	1	compute	Data assumes full 384 procs but cluster was not always that big	' '
14	9	cluster	256	512	2	8/1/2003	9/1/2003	10/5/2009	node	4	2	8	1	compute		' '
15	10	cluster	256	512	2	8/1/2003	9/1/2003	10/5/2009	node	4	2	8	1	compute		' '
16	11	cluster	256	512	2	8/1/2003	9/1/2003	10/5/2009	node	4	2	8	1	compute		' '
18	12	cluster	512	1024	2	8/1/2003	9/1/2003	10/5/2009	node	4	2	8	1	compute	When two units were joined as one 512 node system	' '
17	13	cluster	256	512	2	8/1/2003	9/1/2003	10/5/2009	node	4	2	8	1	compute		' '
13	14	cluster	128	256	2	8/1/2003	9/1/2003	10/5/2009	node	4	2	8	1	compute		' '
22	15	numa	1	265	256	11/1/2004	11/1/2004	10/5/2009	part	1024	5	5	1	compute
19	16	cluster	16	2048	128	10/1/1996	12/1/1996	9/1/2002	part	32	1	1	4	compute	Part of time with fewer nodes
7	18	cluster	1024	4096	4	3/1/2002	5/1/2002	3/1/2008	part	16	2	2	2	compute		' '
8	19	cluster	1024	4096	4	8/1/2002	10/1/2002	3/1/2008	part	16	2	2	2	compute		' '
5	20	cluster	512	2048	4	10/1/2001	12/1/2001	3/1/2008	part	16	2	2	2	compute		' '
6	21	cluster	128	512	4	8/1/2001	9/1/2001	1/1/2002	part	16	2	2	2	compute
3	22	smp	1	4	4	before tracking	before tracking	4/1/2003	part	1	6	6	0	compute
21	23	cluster	5	544	128	10/1/1998	10/1/1998	12/1/2004	part	128	1	1	4	compute	Includes front end
2	24	smp	1	32	32	before tracking	before tracking	12/1/2003	part	8	7	7	0	compute
na	25	cluster	128	256	2	12/2/2011	N/A (testbad)	current	node	4	8	8	1	testbed
na	26	cluster	128	1024	8	10/1/2006	N/A (testbad)	current	node	16	9	9	1	testbed		' '
na	27	cluster	139	1112	8	10/1/2006	12/1/2006	current	node	16	9	9	1	compute		' '
na	28	cluster	1834	14672	8	10/1/2006	5/1/2007	current	node	32	9	9	1	compute
na	29	cluster	12	48	4	1/1/2009	4/15/2009	current	node	32	10	9	1	compute		' '
na	30	cluster	360	1440	4	10/15/2008	2/1/2009	current	node	32	10	9	1	compute		' '
na	31	cluster	12	48	4	10/15/2008	2/10/2009	current	node	32	10	9	1	compute		' '
na	32	cluster	12	48	4	7/1/2009	9/1/2009	current	node	32	10	9	1	compute
na	33	cluster	3060	12240	4	10/15/2008	2/1/2009	current	node	32	10	9	1	compute	machine was in one network for 9 months or so before taking to a different network
na	34	cluster	272	4352	16	6/1/2008	12/15/2008	current	node	32	11	10	1	compute	machine had sdc issues, took long time to certify
na	35	cluster	64	1024	16	7/1/2008	6/1/2009	current	node	32	11	10	1	compute	machine had sdc issues, took long time to certify
na	36	cluster	408	5760	16	7/1/2008	6/1/2009	current	node	32	11	10	1	compute	machine had sdc issues, took long time to certify
na	37	cluster	416	13312	32	9/1/2010	2/15/2011	current	node	64	11	11	1	compute
na	38	cluster	592	4736	8	11/1/2010	3/1/2011	current	node	24	12	11	1	compute
na	39	cluster	620	4960	8	11/1/2010	3/1/2100	current	node	24	12	11	1	compute
na	40	cluster	68	1088	16	10/1/2010	12/1/2010	current	node	32	13	11	1	compute
na	41	cluster	6654	107264	16	10/1/2010	2/1/2011	current	node	32	13	11	1	compute
na	46	cluster	256	512	2	10/1/2005	11/15/2005	10/5/2009	node	16	2	8	1	compute
na	47	cluster	256	512	2	10/1/2005	11/15/2005	10/5/2009	node	8	2	8	1	compute
na	48	cluster	256	512	2	10/1/2005	11/15/2005	10/5/2009	node	8	2	8	1	compute
na	49	cluster	256	512	2	10/1/2005	11/15/2005	10/5/2009	node	8	2	8	1	compute
na	50	cluster	256	512	2	10/1/2005	11/15/2005	10/5/2009	node	8	2	8	1	compute
na	51	cluster	256	512	2	10/1/2005	11/15/2005	10/5/2009	node	8	2	8	1	compute
na	52	cluster	256	1024	4	10/1/2005	11/15/2005	10/5/2009	node	8	9	8	1	compute
na	53	cluster	300	600	2	8/1/2003	9/1/2003	10/5/2009	node	8	2	8	1	compute
na	54	cluster	256	512	2	8/1/2005	10/1/2005	10/5/2009	node	8	2	8	1	compute
na	55	cluster	256	512	2	8/1/2005	10/1/2005	10/5/2009	node	8	2	8	1	compute
na	56	cluster	128	256	2	8/1/2005	11/1/2005	10/5/2009	node	16	2	8	1	compute
na	57	cluster	258	516	2	12/1/2005	6/1/2006	3/15/2011	node	8	2	8	1	compute	SDC issue = heat and voltage delayed prod
na	58	cluster	258	516	2	12/1/2005	6/1/2006	3/15/2011	node	8	2	8	1	compute	SDC issue = heat and voltage delayed prod
na	59	cluster	258	516	2	12/1/2005	6/1/2006	3/15/2011	node	8	2	8	1	compute	SDC issue = heat and voltage delayed prod
na	60	cluster	258	516	2	12/1/2005	6/1/2006	3/15/2011	node	8	2	8	1	compute	SDC issue = heat and voltage delayed prod
na	61	cluster	258	516	2	12/1/2005	6/1/2006	3/15/2011	node	8	2	8	1	compute	SDC issue = heat and voltage delayed prod
na	62	cluster	32	64	2	8/1/2007	10/1/2007	3/1/2011	node	4	2	8	1	compute		' '
na	63	cluster	128	256	2	8/1/2006	10/1/2006	8/1/2010	node	8	2	8	1	compute
na	64	cluster	180	720	4	10/15/2008	2/1/2009	current	node	32	10	9	1	compute
na	65	cluster	136	2176	16	6/1/2008	12/15/2008	current	node	32	11	10	1	compute	machine had SDC issues, took long time to certify
na	67	cluster	12960	38400	2 and 4		2005			2	15		1	compute
na	68	cluster	1536	12288	8	Q3 2005	8/2005	8/2010		16 and 32	16		1	compute
na	69	cluster	36864	147456	4		12/2009	current		4	17		3	compute

Download system layout files here

System 20 job exit status information

Job Status Values 
Status                                   Description

running           All processes are scheduled
suspended         All processes are suspended
finished          All processes have exited
hung              One or more nodes is not responding
aborted           User aborted job (Ctrl/C)
failed            One or more of the nodes in use by this job crashed or was configured out
killed            An application process was killed by a signal  (this can be a user signal or other signal)
expired           Time limit has expired
syskill           Job was killed by an administrative user  (can be because of a problem or because an admin needed to do maintenance etc.)

We are happy to host these data to enable Research.

If you have data that you can make available and need a place to host these data or if you know of this type of data available on the internet, and want to share a link with us, feel free to email us just click here.

Reliability/Interrupt/Failure/Usage Data sets available

These data have been used in a paper at the Carnegie Mellon University Parallel Data Lab. This paper is an excellent example of how these data can be used in computer science research endeavors. The paper numbers the systems differently than the numbering in the data itself. In the table below both the original system number in the data and the CMU paper number is given. This is a link to the CMU paper: CMU Failure Project Site

Additionally, other Los Alamos related supercomputer failure papers:

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Bianca Schroeder at CMU has been kind enough to provide a FAQ about the 1995-2005 failure data set. Click here to see that FAQ

Reliability/Interrupt/Failure/Usage Data Sets for the 1995-2005 timeframe

description	size	records	name	get file
all systems failure/interrupt data 1996-2005	2963538	23741	LA-UR-05-7318-failure-data-1996-2005.csv	email:
system 20 usage with domain info	51675641	489376	LA-UR-06-0803-MX20_NODES_0_TO_255_DOM.TXT	email:
system 20 usage with node info nodes number from zero	43926669	489376	LA-UR-06-0803-MX20_NODES_0_TO_255_NODE-Z.TXT	email:
system 20 event info nodes number from zero	33120015	433490	LA-UR-06-0803-MX20_NODES_0_TO_255_EVENTS.csv	email:
system 20 node internal disk failure info nodes number from zero	209	14	LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT	email:
system 15 usage with node info nodes number from zero	2416139	17823	LA-UR-06-0999-MX15-NODE-Z.TXT	email:
system 16 usage with node info nodes number from one	321293488	1630479	LA-UR-06-1446-MX16-NODE-NOZ.TXT	email:
system 23 usage with node info nodes number from one	60674531	654927	LA-UR-06-1447-MX23-NODE-NOZ.TXT	email:
system 8 usage with node info nodes number from one	67291020	763293	LA-UR-06-3194-MX8-NODE-NOZ.TXT	email:

Reliability/Interrupt/Failure/Usage Data Sets for the 1995-2011 timeframe

Consolidated failure data is for 1995 all the way to 09/2005 is availabler here.

Available Soon

Los Alamos National Laboratory
The World's Greatest Science Protecting America

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy