DISC is operational and accepting users with jobs that involve large datasets who would like to explore the possibilities offered by a cluster designed for data intensive applications. If you are interested in running a job on DISC contact Reese Baird.
The IS&T Data Exploratory is a unique data intensive supercomputing facility designed to explore the challenges of working with large datasets. Sensors, Internet packet filters, telescopes and satellites all generate massive amounts of data. To solve problems in areas like energy security, bio-security, and cosmology we need effective ways to analyze this data. The Data Exploratory will provide a trial facility in an open environment where researchers can experiment with data intensive methods.
The current cluster has 60 nodes each with 6 GB of memory and 2 TB of disk space. A new 720 node data intensive cluster is being constructed and will be available to preliminary users beginning in September 2010. The 720 nodes will be divided into three segments with different use plans. One of the segments will have 16 GB of memory and 4 TB of disk space on each node and will be on the yellow network. The second segment will be brought up later and the plan is for it to be on the turquoise network. It will have 8 GB of memory and 4 TB of disk space on each node. The third segment will have 128 nodes each with 128 GB solid state drives. The remaining nodes will be used for testing and experimentation.
One of the challenges of large datasets is representing the data in a manageable format that scientists can learn from. The IS&T Data Exploratory will include a visualization environment, the DISC Visualization Collaboratory, to provide an alternative method for looking at results from data intensive applications. The goal is to find new ways of using visualization for information applications, to provide an abstraction of the data rather than a simulation of it.
Scientific computing has generally composed of simulation workloads. However, an emerging trend is data-intensive scientific computing; applications which are I/O dominated instead of being computationally bound.
Allegrograph is a software package for creating triple-stores which connect data in subject, predicate, object relationships. Allegrograph uses disk-based storage making it a good fit for the DISC cluster. The software has support for social network, geospatial, and semantic web applications. Since the data is stored in graph format it can be useful to look at the data graphically. The DISC Viz Collaboratory will provide this capability. We are putting a federated installation of Allegrograph on the DISC cluster and already have several interested users.
Cosmology simulations have the ability to generate petabytes, even exabytes of data. Once the data is generated it must be processed in order to provide useful knowledge; which is a heavily I/O bound operation. One such cosmology application, uses a clustering technique to highlight interesting regions of the simulation space. Shown in the figure:
A present and mounting concern at the lab now is Cybersecurity. one way to insure lab security is to analyze all lab network traffic. With so much I/O traffic generated on a daily basis, analysis becomes very data-intensive.
This example of image processing takes a photo and identifies potential threats. While this example is simplistic, the idea can easily be expanded to something more complex like threat analysis on GIS data or even real-time video, which becomes much more data-intensive.