CCIM LogoSandia National Laboratories Logo
HomeCapabilitiesOrganizationAwardsPublications and PresentationsCareer OpportunitiesCSRIPlatforms


Platforms and System Software

Platforms

System Software


The History of Platforms at Sandia:

In the mid-eighties, Sandia began to take parallel computing seriously and in 1987 formed the Massively Parallel Computing Research Lab (MPCRL). In addition to pioneering much of the system software, many algorithmic advances, and computational methods for massively parallel computing, the MPCRL fielded the first large-scale massively parallel computers and demonstrated their potential for technical computing.

In 1987, we fielded the first 1000+ processor computer, the nCUBE-10, a 1024-processor system based on a 10-dimensional hypercube network. With the nCUBE-10, we were the first to get over 100-fold and 1000-fold speedup (measured against the speed of a single processor of the same machine) on real scientific and technical applications. This work led to the only Karp Challenge ever awarded (for the first achievement of more than 100-fold speedup), to the first Gordon Bell Award, (for three applications that achieved speedups of over 1000 on a 1024-processor system), to R&D-100 Awards and to patents for MPP computing techniques.

In 1988, we fielded a 16384-way CM-2 from Thinking Machines Corporation that was later upgraded to a CM-200. This was a SIMD (single instruction, multiple data stream) system and we evaluated its performance against MIMD (multiple instructions, multiple data stream) systems like those we had from the NCUBE Corporation. The CM-2 had 16384 1-bit processing elements and 512 32-bit floating point units (32 single-bit elements were associated with each floating
point unit).

In 1990, we fielded two 1024-processor nCUBE-2 supercomputers. The nCUBE also had a 10-dimensional hypercube network. It was the first MPP competitive with the parallel vector machines from Cray Research. It had a peak floating point capability of 2 Gigaflops. We achieved a large fraction of that peak speed on many real applications. Its outstanding performance was enabled by its highly balanced design (fast network) and by its highly efficient system software. This platform enabled us to develop our first light-weight kernel (LWK) operating system, SUNMOS.

In 1992, we fielded a 64-processor Intel IPSC-860. This machine was an interesting research engine and prepared the way for a partnership with Intel that lasted until we jointly fielded the world's first terascale system in 1996.

In 1993, we fielded the 3800+ processor Intel Paragon with a peak speed in excess of 100 Gigaflops. It was the first MPP to be indisputably the fastest computer in the world. The operating system, OSF-1, supplied by Intel for the Paragon failed to scale well: the OS buffers took up the entire memory on the system for large numbers of processors and OS overhead increased to huge levels as the machine grew. Intel eventually fixed these problems with the Paragon OS. However, we didn't wait. Within four months of installing the Paragon at Sandia we had ported our LWK, SUNMOS, to the Paragon and it and associated runtime software became the basis of operations on that machine. At the same time we began to develop a second-generation LWK called PUMA which eventually replaced SUNMOS and which Intel and Sandia would later use as the basis for Cougar, the LWK that powered TFlops (also know as ASCI RED) the first machine to exceed a teraflops on Linpack and later on real applications.

In 1996, Intel and Sandia fielded a 9300+ processor MPP at Sandia that had a peak floating point rating of over 1.8 teraflops (TF). This machine achieved over 1 TF on Linpack as part of its acceptance process. It later was upgraded to 3.1 TF pak rating and its memory was doubled to 1.2 terabytes (TB). It was the fastest machine in the world from early 1997 into 2002. It was also one of the most reliable machines ever built, based in part on reliability, availability, and serviceability (RAS) being built into every feature of the design and in part on its Sandia-provided partitioned operating environment with most of the nodes running Sandia's third-generation, minimalist LWK operating system, Cougar. As of 2003, RED is still in full production at Sandia.

In 2002, Sandia and Cray, Inc. entered into a contract to develop and field RED STORM, a Sandia-architected, Cray-engineered MPP with over 10,000 processors. RED STORM will provide a highly, balanced, cost-effective and reliable MPP by drawing on the heritage of Sandia's nCUBE's, Paragon, RED, and Cray's T3D and T3E systems. The RED STORM architecture is designed to scale to 30,000+ processors and up to a Petaflops in later versions. The initial system will have a rated peak in the 50-TF range. It will be ready for use in late 2004.


Newsnotes | Info and Events (internal - SNL only) | Open-Source Software Downloads | Privacy and Security
Sandia National Laboratories Home Page - External or Internal (SNL only)

Maintained by: Bernadette Watts
Modified on: May 6, 2008