Blackbox is a project to streamline the manipulation of large data sets between computational clusters and long term archival storage. The current UX is difficult to work with, and frustrating for researchers. Since supercomputer time is expensive, and data sets are extremely large in size, this becomes a critical issue.
The proposed goal is to allow a user to stay on the compute cluster, but offload filesystem operations to a special cluster while keeping in mind the users' perceptions and responses that result from this model. This will let the compute nodes compute, but won't degrade them with file transfers at the expense of computation.
Krishna Devarapalli
UCSC Student
Shashwat Kandadai
UCSC Student
David Laone
UCSC Student
Mark Wagner
UCSC Student
Scott Brandt
UCSC Instructor
Huahai Yang
UCSC Instructor
Jharrod LaFon
Mentor
David Sherrill
Mentor
David R. Montoya
Mentor
Jonathan Bringhurst
Mentor
Today's file systems use interfaces largely conforming to POSIX IO, a standard that was designed in the mid-1960s when high-end file systems stored less than 100MB. Today’s high-end file systems in super-computing environments and search engine and social network companies are up to 7-9 orders of magnitude larger, resulting in numbers of data items for which POSIX abstractions are quite inadequate. With the advent of exascale systems this inadequacy will continue to grow worse. We propose to coalesce the functionality of data analysis, management, and file systems by extending file systems with data management services, including declarative querying, distributed query planning and optimization, automatic indexing, a common data model for most scientific file formats, and provenance tracking that spans multiple layers of abstraction. These services are optimized for predictability, throughput, and low latency while minimizing data movement and power consumption. We will leverage important insights gained by both the file system and the database communities. However, the sheer amount of data managed by file systems and their associated access patterns require a design very different from current database management systems.
SUBPROJECTS:
"SciHadoop" (Joe Buck)
"FLAMBES: Scalable Simulation of Parallel File Systems" (Adam Crume)
"HDF5-FS" (Latchesar Ionkov)
"Redundant Indexing" (Jeff LeFevre)
"Parallel Querying of Scientific Data" (Noah Watkins)
"QMDS:Queriable File System Metadata Services" (Sasha Ames, Funded by LLNL)
Sasha Ames
UCSC Student
Joe Buck
UCSC Student
Noah Watkins
UCSC Student
Latchesar Ionkov
UCSC Student
Jeff LeFevre
UCSC Student
Adam Crume
UCSC Student
Wang-Chiew Tan
UCSC Instructor
Neoklis Polyzotis
UCSC Instructor
Scott Brandt
UCSC Instructor
Carlos Maltzahn
UCSC Instructor
Maya Gokhale
LLNL Instructor
Kleoni Ioannidou
UCSC Instructor
Gary Grider
Mentor
John Bent
Mentor
James Ahrens
Mentor
Carolyn Connor
Mentor
Michael Lang
Mentor
We propose to apply information technology developed for personalized recommendation sys- tems and collaborative filtering (e.g. amazon.com’s ”Customers who bought this also bought” or Netflix’s ”You might also like”) to ill-defined scientific issues where choosing the most relevant information from a large amount of slightly-relevant information becomes literally of vital importance. The scenario is an outbreak of an infectious pathogen, where the nature of the pathogen is known through previous analysis. The question then becomes: what to do next? Quarantine, triage, treat, limit further exposure, locate drug and vaccine supplies, determine their likely efficacy and the most efficient way to use them as well as the time they will take to arrive, find subject experts, identify possible origins, and countermeasures, etc. Much information that can be highly relevant in this scenario exists on the internet, but locating it and assessing its reliability and relevance is extremely difficult. We believe that personalized recommendation systems can play a role in focusing the searches and finding the most important and urgently needed information.
Lanbo Zhang
UCSC Student
Yi Zhang
UCSC Instructor
Benjamin McMahon
Mentor
Reid Priedhorsky
Mentor
Rajan Vaish
UCSC Student
Neoklis Polyzotis
UCSC Instructor
James Davis
UCSC Instructor
Reid Porter
Mentor
Reid Priedhorsky
Mentor
This project examines a promising architecture that uses a limited number of SSDs (solid state drives) to decrease the power consumption and either increase the performance or reliablitility of RAID storage subsystems. We are researching the use of SSDs for parity storage in a disk-SSD hybrid RAID system.
Rosie Wacha
UCSC Student
Scott Brandt
UCSC Instructor
Gary Grider
Mentor
James Nunez
Mentor
John Bent
Mentor
Meghan McClelland
Mentor
Uliana Popov
UCSC Student
Alex Pang
UCSC Instructor
Christopher Brislawn
Mentor
This project was first conceived as a storage system for the Long Wavelength Array (LWA) radio telescope, and subsequently extended to incorporate several other problems with similar themes. The problem is characterized by ultra-high bandwidth data streams in which most data is "useless," but occasionally is "useful" but not recognized as such until a later period. These traits combine to form an odd set of requirements: there is far too much data to retain in the long-term, but which needs to be stored in the short term in order to determine whether or not a portion of it is useful.
In the case of the LWA, a practical example of this data pattern is that a significant astronomical event is detected at time T. In order to understand as much about the event as possible, scientists might declare that they need all of the data in the ten minutes prior to time T, as well as the data for some amount of time after time T. Similarly, in a cybersecurity example, an intrusion may be detected at time T coming from source A. In order to track the intrusion as closely as possible, network security personnel might declare that they need all IP packets coming from source A prior to time T, as far back as possible. These two examples involve very similar data patterns, but the data itself does not share similar organization. The main difference is in how the data is organized. In one case, a system must save and track large data elements without needing to search on anything more complicated than time. In the other, a system much save and track very small data elements which are highly structured, and needs to search on many aspects of each element, for example: origin, destination, and size. The disparity between these two types of data makes a general-purpose solution much more tricky.
The primary focus in this research is centered around providing and maintaining quality of service guarantees for real-time data on standard disk drives. This problem is complicated by the addition of other requirements which must be integrated into the quality of service guarantees. Other processes must be able to access the drive, reliability mechanisms must be implemented to ensure data resiliency, the data must be indexed and quickly searchable for when preservation commands are given, and the system must be able to scale upwards to meet increasing bandwidth requirements. None of these extra requirements can be allowed to disrupt the central need for quality of service guarantees on the real-time data stream. This project also has research potential for a number of related secondary goals. Due to the nature of the data collection, an opportunity exists to test how well hard drives perform under unrelenting high- bandwidth writing over months and years. Solid state storage devices are unsuitable for managing this type of high-bandwidth data due to limited write cycles, but may be extremely useful when applied to specific indexing needs only. Non-standard error-correction codes not normally used for standard storage systems may find a new place in a system where every single write is a "large write" and reading, even for repair work, is extremely rare.
David Bigelow
UCSC Student
Scott Brandt
UCSC Instructor
Gary Grider
Mentor
John Bent
Mentor
HB Chen
Mentor
Sarah Michalak
Mentor
Janelle Yong
UCSC Student
Don Wiberg
UCSC Instructor
Eric Raby
Mentor
John Galbraith
Mentor