Contacts

Institute Director
Francis J. Alexander
(505) 665-4518
Institute Deputy Director
Carolyn Connor (bio)
(505) 665-9891
Institute Office Manager
Josephine Olivas
(505) 663-5725

Collaborative Research Projects

Exascale Data Management in File Systems

Filtering Internet Information for Use in Biothreat Scenarios

Incorporating solid state drives into distributed storage systems (Using Solid State Drives in Distributed Storage Systems)

JPEG 2000 Compression on Scientific Data (Climate Data)

Managing High-Bandwidth Real-Time Data Storage (Storage Systems for High-Bandwidth, Low-Lifetime Data)

Recursive Parameter Estimation Algorithm

Blackbox

Blackbox is a project to streamline the manipulation of large data sets between computational clusters and long term archival storage. The current UX is difficult to work with, and frustrating for researchers. Since supercomputer time is expensive, and data sets are extremely large in size, this becomes a critical issue.

The proposed goal is to allow a user to stay on the compute cluster, but offload filesystem operations to a special cluster while keeping in mind the users' perceptions and responses that result from this model. This will let the compute nodes compute, but won't degrade them with file transfers at the expense of computation.

Krishna Devarapalli
UCSC Student

Shashwat Kandadai
UCSC Student

David Laone
UCSC Student

Mark Wagner
UCSC Student

Scott Brandt
UCSC Instructor

Huahai Yang
UCSC Instructor

Jharrod LaFon
Mentor

David Sherrill
Mentor

David R. Montoya
Mentor

Jonathan Bringhurst
Mentor

Project Document

Top of page

Exascale Data Management in File Systems

Today's file systems use interfaces largely conforming to POSIX IO, a standard that was designed in the mid-1960s when high-end file systems stored less than 100MB. Today’s high-end file systems in super-computing environments and search engine and social network companies are up to 7-9 orders of magnitude larger, resulting in numbers of data items for which POSIX abstractions are quite inadequate. With the advent of exascale systems this inadequacy will continue to grow worse. We propose to coalesce the functionality of data analysis, management, and file systems by extending file systems with data management services, including declarative querying, distributed query planning and optimization, automatic indexing, a common data model for most scientific file formats, and provenance tracking that spans multiple layers of abstraction. These services are optimized for predictability, throughput, and low latency while minimizing data movement and power consumption. We will leverage important insights gained by both the file system and the database communities. However, the sheer amount of data managed by file systems and their associated access patterns require a design very different from current database management systems.

SUBPROJECTS:
"SciHadoop" (Joe Buck)
"FLAMBES: Scalable Simulation of Parallel File Systems" (Adam Crume)
"HDF5-FS" (Latchesar Ionkov)
"Redundant Indexing" (Jeff LeFevre)
"Parallel Querying of Scientific Data" (Noah Watkins)
"QMDS:Queriable File System Metadata Services" (Sasha Ames, Funded by LLNL)

Sasha Ames
UCSC Student

Joe Buck
UCSC Student

Noah Watkins
UCSC Student

Latchesar Ionkov
UCSC Student

Jeff LeFevre
UCSC Student

Adam Crume
UCSC Student

Wang-Chiew Tan
UCSC Instructor

Neoklis Polyzotis
UCSC Instructor

Scott Brandt
UCSC Instructor

Carlos Maltzahn
UCSC Instructor

Maya Gokhale
LLNL Instructor

Kleoni Ioannidou
UCSC Instructor

Gary Grider
Mentor

John Bent
Mentor

James Ahrens
Mentor

Carolyn Connor
Mentor

Michael Lang
Mentor

Top of page

Filtering Internet Information for Use in Biothreat Scenarios

We propose to apply information technology developed for personalized recommendation sys- tems and collaborative filtering (e.g. amazon.com’s ”Customers who bought this also bought” or Netflix’s ”You might also like”) to ill-defined scientific issues where choosing the most relevant information from a large amount of slightly-relevant information becomes literally of vital importance. The scenario is an outbreak of an infectious pathogen, where the nature of the pathogen is known through previous analysis. The question then becomes: what to do next? Quarantine, triage, treat, limit further exposure, locate drug and vaccine supplies, determine their likely efficacy and the most efficient way to use them as well as the time they will take to arrive, find subject experts, identify possible origins, and countermeasures, etc. Much information that can be highly relevant in this scenario exists on the internet, but locating it and assessing its reliability and relevance is extremely difficult. We believe that personalized recommendation systems can play a role in focusing the searches and finding the most important and urgently needed information.

Lanbo Zhang
UCSC Student

Yi Zhang
UCSC Instructor

Benjamin McMahon
Mentor

Reid Priedhorsky
Mentor

Top of page

Human Assisted Computer Vision

Rajan Vaish
UCSC Student

Neoklis Polyzotis
UCSC Instructor

James Davis
UCSC Instructor

Reid Porter
Mentor

Reid Priedhorsky
Mentor

GHTC12 / ACMDEV12 Paper_Vaish

Top of page

Incorporating solid state drives into distributed storage systems (Using Solid State Drives in Distributed Storage Systems)

This project examines a promising architecture that uses a limited number of SSDs (solid state drives) to decrease the power consumption and either increase the performance or reliablitility of RAID storage subsystems. We are researching the use of SSDs for parity storage in a disk-SSD hybrid RAID system.

Rosie Wacha
UCSC Student

Scott Brandt
UCSC Instructor

Gary Grider
Mentor

James Nunez
Mentor

John Bent
Mentor

Meghan McClelland
Mentor

Top of page

JPEG 2000 Compression on Scientific Data (Climate Data)

Uliana Popov
UCSC Student

Alex Pang
UCSC Instructor

Christopher Brislawn
Mentor

PacificVis2012—Analyzing the Evolution of Large Scale Structures in the Universe with Velocity Based Methods Paper

Top of page

Managing High-Bandwidth Real-Time Data Storage (Storage Systems for High-Bandwidth, Low-Lifetime Data)

This project was first conceived as a storage system for the Long Wavelength Array (LWA) radio telescope, and subsequently extended to incorporate several other problems with similar themes. The problem is characterized by ultra-high bandwidth data streams in which most data is "useless," but occasionally is "useful" but not recognized as such until a later period. These traits combine to form an odd set of requirements: there is far too much data to retain in the long-term, but which needs to be stored in the short term in order to determine whether or not a portion of it is useful.

In the case of the LWA, a practical example of this data pattern is that a significant astronomical event is detected at time T. In order to understand as much about the event as possible, scientists might declare that they need all of the data in the ten minutes prior to time T, as well as the data for some amount of time after time T. Similarly, in a cybersecurity example, an intrusion may be detected at time T coming from source A. In order to track the intrusion as closely as possible, network security personnel might declare that they need all IP packets coming from source A prior to time T, as far back as possible. These two examples involve very similar data patterns, but the data itself does not share similar organization. The main difference is in how the data is organized. In one case, a system must save and track large data elements without needing to search on anything more complicated than time. In the other, a system much save and track very small data elements which are highly structured, and needs to search on many aspects of each element, for example: origin, destination, and size. The disparity between these two types of data makes a general-purpose solution much more tricky.

The primary focus in this research is centered around providing and maintaining quality of service guarantees for real-time data on standard disk drives. This problem is complicated by the addition of other requirements which must be integrated into the quality of service guarantees. Other processes must be able to access the drive, reliability mechanisms must be implemented to ensure data resiliency, the data must be indexed and quickly searchable for when preservation commands are given, and the system must be able to scale upwards to meet increasing bandwidth requirements. None of these extra requirements can be allowed to disrupt the central need for quality of service guarantees on the real-time data stream. This project also has research potential for a number of related secondary goals. Due to the nature of the data collection, an opportunity exists to test how well hard drives perform under unrelenting high- bandwidth writing over months and years. Solid state storage devices are unsuitable for managing this type of high-bandwidth data due to limited write cycles, but may be extremely useful when applied to specific indexing needs only. Non-standard error-correction codes not normally used for standard storage systems may find a new place in a system where every single write is a "large write" and reading, even for repair work, is extremely rare.