LANL Institutes - Information Science and Technology Institute

Contacts

Institute Director
Francis J. Alexander
(505) 665-4518
Institute Deputy Director
Carolyn Connor (bio)
(505) 665-9891
Institute Office Manager
Josephine Olivas
(505) 663-5725

IS&T Data Exploratory - Data Intensive Supercomputing at LANL

DISC Accepting Users

DISC is operational and accepting users with jobs that involve large datasets who would like to explore the possibilities offered by a cluster designed for data intensive applications. If you are interested in running a job on DISC contact Reese Baird.

About DISC

The IS&T Data Exploratory is a unique data intensive supercomputing facility designed to explore the challenges of working with large datasets. Sensors, Internet packet filters, telescopes and satellites all generate massive amounts of data. To solve problems in areas like energy security, bio-security, and cosmology we need effective ways to analyze this data. The Data Exploratory will provide a trial facility in an open environment where researchers can experiment with data intensive methods.

The current cluster has 60 nodes each with 6 GB of memory and 2 TB of disk space. A new 720 node data intensive cluster is being constructed and will be available to preliminary users beginning in September 2010. The 720 nodes will be divided into three segments with different use plans. One of the segments will have 16 GB of memory and 4 TB of disk space on each node and will be on the yellow network. The second segment will be brought up later and the plan is for it to be on the turquoise network. It will have 8 GB of memory and 4 TB of disk space on each node. The third segment will have 128 nodes each with 128 GB solid state drives. The remaining nodes will be used for testing and experimentation.

One of the challenges of large datasets is representing the data in a manageable format that scientists can learn from. The IS&T Data Exploratory will include a visualization environment, the DISC Visualization Collaboratory, to provide an alternative method for looking at results from data intensive applications. The goal is to find new ways of using visualization for information applications, to provide an abstraction of the data rather than a simulation of it.

DISC Systems Research

Scientific computing has generally composed of simulation workloads. However, an emerging trend is data-intensive scientific computing; applications which are I/O dominated instead of being computationally bound.

Compute and Data-Intensive Computing: Bridging the Gap

Allegrograph on DISC

Allegrograph is a software package for creating triple-stores which connect data in subject, predicate, object relationships. Allegrograph uses disk-based storage making it a good fit for the DISC cluster. The software has support for social network, geospatial, and semantic web applications. Since the data is stored in graph format it can be useful to look at the data graphically. The DISC Viz Collaboratory will provide this capability. We are putting a federated installation of Allegrograph on the DISC cluster and already have several interested users.

DISC for Cosmology

Cosmology simulations have the ability to generate petabytes, even exabytes of data. Once the data is generated it must be processed in order to provide useful knowledge; which is a heavily I/O bound operation. One such cosmology application, uses a clustering technique to highlight interesting regions of the simulation space. Shown in the figure:

Map phase: Simulation area is divided into smaller datasets and regions of interest are identified
Reduce phase: Simulation space is reassembled into one file with areas of interest highlighted for the user.

DISC for Cyber Security

A present and mounting concern at the lab now is Cybersecurity. one way to insure lab security is to analyze all lab network traffic. With so much I/O traffic generated on a daily basis, analysis becomes very data-intensive.

Map Phase: Network traffic is streamed into the application and split among multiple TaskTrakers. Threats are identified and recorded in each map task.
Reduce Phase: The malicious activity identifified by the map tasks are collected into one file and delivered to cyber-security personnel in real time.

DISC for Image Processing

This example of image processing takes a photo and identifies potential threats. While this example is simplistic, the idea can easily be expanded to something more complex like threat analysis on GIS data or even real-time video, which becomes much more data-intensive.

Map Phase: Similar to Cosmology, the image is chunked into smaller regions. These regions are then processed to identify potential threats.
Reduce Phase: The image is reassembled with the identified threat areas easily identifiable to the user.