LANL Institutes - Information Science and Technology Institute

Contacts

Institute Director
Francis J. Alexander
(505) 665-4518
Institute Deputy Director
Carolyn Connor (bio)
(505) 665-9891
Institute Office Manager
Josephine Olivas
(505) 663-5725

2012 Posters
2012 Presentations
2011 Posters
2011 Presentations
2010 Posters
2010 Panels
2010 Presentations
2009 Posters
2009 Presentations
2008 Posters
2008 Presentations
2007 Posters
2007 Presentations

2012 Posters

ACMDEV12: Exploring employment opportunities through microtasks via cybercafes
Rajan Vaish, James Davis
Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.

2012 Presentations

ACMDEV12: Exploring employment opportunities through microtasks via cybercafes
Rajan Vaish, James Davis
Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.
GHTC12: Exploring employment opportunities through microtasks via cybercafes
Rajan Vaish, James Davis
Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.
Greenplum—SciHadoop: Array-based Query Processing in Hadoop
Joe Buck
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
Microsoft Research—SciHadoop: Array-based Query Processing in Hadoop
Joe Buck
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
MSST2012—On the Role of Burst Buffers in Leadership-Class Storage Systems
Gary Grider, Adam Crume, Carlos Maltzahn
MSST2012—Valmar: High-Bandwidth Real-Time Streaming Data Management
John Bent, HB Chen, David Bigelow, Scott Brandt
PacificVis2012—Analyzing the Evolution of Large Scale Structures in the Universe with Velocity Based Methods
James Ahrens, Eddy Chandra, Uliana Popov, Alex Pang
SIGIR12: Summarizing Highly Structured Documents for Effective Search Interaction
Lanbo Zhang, Yi Zhang

2011 Posters

2011 ISSDM Day—Computer Vision using Human Computation
James Davis
Computer-mediated human micro-labor markets allow human effort to be treated as a programmatic function call. We can characterize these platforms as Human Processing Units (HPU). HPU computation can be more accurate than complex CPU based algorithms on some important computer vision tasks. We also argue that HPU computation can be cheaper than state-of-the-art CPU based computation. I'll give some examples of simple computer vision tasks that we have evaluated, and speculate on whether a finite computer vision instruction set is possible. The instruction set would allow most computer vision problems to be coded from the base instructions, and the instructions themselves would be made robust with the help of human computation.
2011 ISSDM Day—Crowdsight: Rapidly Prototyping Visual Processing Apps
Reid Porter, Mario Rodriguez, James Davis
We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which reliable algorithms do not yet exist, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system, freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.
2011 ISSDM Day—Crowdsight: Rapidly Prototyping Visual Processing Apps
John Galbraith, Janelle Yong, Don Wiberg
2011 ISSDM Day—Divergent Physical Design Tuning
John Bent, Gary Grider, Michael Lang, Meghan McClelland, James Nunez, Jeff LeFevre, Scott Brandt, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis
We introduce a new method for tuning the physical design of replicated databases. Our method installs a different (divergent) index configuration to each replica, thus specializing replicas for different subsets of the database workload. We analyze the space of divergent designs and show that there is a tension between the specialization of each replica and the ability to load-balance the database workload across different replicas. Based on our analysis, we develop an algorithm to compute good divergent designs that can balance this trade-off. Experimental results demonstrate the efficacy of our approach.
2011 ISSDM Day—Filtering Semi-Structured Documents Based on Faceted Feedback
Carla Kuiken, Lanbo Zhang, Yi Zhang
Existing adaptive filtering systems learn user profiles based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature "Language: Spanish"; a researcher working on HIV knows an article with the medical subject "MeSH: AIDS" is very likely to be interesting to him/her.

Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for filtering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user profile-learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC filtering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for filtering. The new user profile learning algorithm can effectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classification tasks in previous work.
2011 ISSDM Day—FLAMBES: Evolving Fast Performance Models
John Bent, Stephan Eidenbenz, Meghan McClelland, Adam Crume, Carlos Maltzahn, Neoklis Polyzotis, Manfred Warmuth
Large clusters and supercomputers are simulated to aid in design. Many devices, such as hard drives, are slow to simulate. Our approach is to use a genetic algorithm to fit parameters for an analytical model of a device. Fitting focuses on aggregate accuracy rather than request-level accuracy since individual request times are irrelevant in large simulations. The model is fitted to traces from a physical device or a known device-accurate model. This is done once, offline, before running the simulation. Execution of the model is fast, since it only requires a modest amount of floating point math and no event queuing. Only a few floating-point numbers are needed for state. Compared to an event-driven model, this trades a little accuracy for a large gain in performance.
2011 ISSDM Day—Halo Finder vs Local Extractors: Similarities and Differences
Christopher Brislawn, Uliana Popov, Alex Pang
Multi-streaming events are of great interest to astrophysics because they are associated with the formation of largescale structures (LSS) such as halos, filaments and sheets. Until recently, these events were studied using scalar density field only. In this talk, we present a new approach that takes into account the velocity field information in finding these multistreaming events. Six different velocity based feature extractors are defined, and their findings are compared to a halo finder results.
2011 ISSDM Day—Insertion-optimized File System
Christine Ahrens, Michael Lang, Latchesar Ionkov, Scott Brandt, Carlos Maltzahn
Gostor is an experimental platform for testing new file storage ideas for post POSIX usage. Gostor provides greater flexibility for manipulating the data within the file, including inserting and deleting data anywhere in the file, creating and removing holes in the data, etc. Each modification of the data creates a new file. Gostor doesn't implement any ways of organizing the files in hierarchical structures, or mapping them to strings. Thus Gostor can be used to implement standard file systems as well as experimenting with new ways of storing and accessing users' data.
2011 ISSDM Day—Managing High-Bandwidth Real-Time Data Storage
John Bent, HB Chen, Gary Grider, Meghan McClelland, James Nunez, David Bigelow, Scott Brandt
In an information-driven world, the ability to capture and store data in real-time is of the utmost importance. The scope and intent of such data capture, however, varies widely. Individuals record television programs for later viewing, governments maintain vast sensor networks to warn against calamity, scientists conduct experiments requiring immense data collection, and automated monitoring tools supervise a host of processes which human hands rarely touch. All such tasks have the same basic requirements -- guaranteed capture of streaming real-time data -- but with greatly differing parameters. Our ability to process and interpret data has grown faster than our ability store and manage it, which has led to the curious condition of being able to recognize the importance of data without being able to store it, and hence unable to later profit by it. 3
Traditional storage mechanisms are not well suited to manage this type of data and we have developed a large-scale ring buffer storage architecture to handle it. Our system is well suited to both large and small data elements, has a native indexing mechanism, and can maintain reliability in the face of hardware failure. Strong performance guarantees can be made and kept, and quality of service requirements maintained.
2011 ISSDM Day—Push-based Processing of Scientific Data
John Bent, Meghan McClelland, James Nunez, Noah Watkins, Scott Brandt, Carlos Maltzahn
Large-scale scientific data is collected through experiment and produced by simulation. This data in turn is commonly interrogated using ad-hoc analysis queries, and visualized with differing interactivity requirements. At extreme scale this data can be too large to store multiple copies, or may be easily accessible for only a short period of time. In either case, multiple consumers must be able to interact with the data. Unfortunately, as the number of concurrent users accessing storage media increases the throughput can decrease significantly. This performance degradation is due to the induced random access pattern that results from uncoordinated I/O streams. One common approach to this problem is to use collective I/O, unfortunately this is difficult to do for many independent computations. We are investigating a data centric, push-based approach inspired by work within the database community that has achieved an order of magnitude increase in throughput for concurrent query processing. A push-based approach to query processing uses a single data stream originating off of storage media rather than allowing multiple requests to compete, and utilizes work and data sharing opportunities exposed through query semantics. There are many challenges that exist in this work, notably supporting a distributed execution environment, providing a mix of access performance requirements (throughput vs. latency), and support for multiple data models including relational and arraybased.
2011 ISSDM Day—QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
Sasha Ames, Maya Gokhale
File system metadata management has become a bottleneck for many data-intensive applications that rely on highperformance file systems. Part of the bottleneck is due to the limitations of an almost 50-year-old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100MB. Today's highperformance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to handle complex relationships among data.

Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems.

To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application. Results from our QMDS prototype show the effectiveness of this approach. Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.
2011 ISSDM Day—Rad-Flows: Buffering for Predictable Communications
Kleoni Ioannidou, Scott Brandt
Real-time systems and applications are becoming increasingly complex and often comprise multiple communicating tasks. The management of the individual tasks is well understood, but the interaction of communicating tasks with different timing characteristics is less well understood. We discuss several representative inter-task communication flows via reserved memory buffers (possibly interconnected via a real-time network) and present RAD-Flows, a model for managing these interactions. We provide proofs and simulation results demonstrating the correctness and effectiveness of RAD-Flows, allowing system designers to determine the amount of memory required based upon the characteristics of the interacting tasks and to guarantee real-time operation of the system as a whole.
2011 ISSDM Day—RAID4S: Supercharging RAID Small Writes with SSD
John Bent, Gary Grider, Meghan McClelland, James Nunez, Rosie Wacha, Scott Brandt, Carlos Maltzahn
Parity-based RAID techniques improve data reliability and availability, but at a significant performance cost, especially for small writes. Flash-based solid state drives (SSDs) provide faster random I/O and use less power than hard drives, but are too expensive to substitute for all of the drives in most large-scale storage systems. We present RAID4S, a costeffective, high-performance technique for improving RAID small-write performance using SSDs for parity storage in a diskbased RAID array. Our results show that a 4HDD+1SSD RAID4S array achieves throughputs 3.3X better than a similar 4+1 RAID4 array and 1.75X better than a 4+1 RAID5 array on small-write-intensive workloads. RAID4S has no performance penalty on disk workloads consisting of up to 90% reads and its benefits are enhanced by the effects of file systems and caches.
2011 ISSDM Day—SciHadoop: Array-based Query Processing in Hadoop
James Ahrens, John Bent, Gary Grider, Michael Lang, Joe Buck, Scott Brandt, Maya Gokhale, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Wang-Chiew Tan
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce SciHadoop, a Hadoop plugin allowing scientists to specify logical queries over arraybased data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a SciHadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
2011 ISSDM Day—The Stanford-UC Santa Cruz Project for Cooperative Computing with Algorithms, Data and People
Neoklis Polyzotis
This talk will provide an overview of the SCOOP project, whose broad theme is to leverage people as processing units to achieve some global objective. A primary focus of SCOOP is to optimize the usage of human computation in order to use as few resources (e.g., time, money) as possible while maximizing the quality of the final output. Our approach is based on the principle of declarative languages that has been applied very successfully in database systems. The talk will describe the main research thrusts in SCOOP and some of our recent accomplishments.
2011 ISSDM Day—Using reputation systems to increase content quality: lessons from Wikipedia and Google Maps
Luca de Alfaro
2011 ISSDM Day—Viral Genomics and the Semantic Web
Carla Kuiken
The “HIV database and analysis platform” has been maintained in Los Alamos for 22 years, and has grown to be an internationally renowned resource for HIV data analysis. It is in the process of expanding to include hepatitis C virus and hemorrhagic fever viruses; the eventual goal is to make it a universal viral resource. This expansion necessitates much greater reliance on external data and information sources. These resources rarely use the same identifies and frequently contain annotator- and submitter-specific language. While efforts have been underway for some time to standardize and cross-link biological information on the web, there still is a long way to go. I will describe current status of the “Viral data analysis platform”, the (semantic) problems we have grappled with, and the local and global efforts at amelioration.
2011 ISSDM Day—Viral Genomics and the Semantic Web
Tracy Holsclaw
Gaussian process (GP) models provide non-parametric methods to fit continuous curves observed with noise.
Motivated by our investigation of dark energy, we develop a GP-based inverse method that allows for the direct estimation of the derivative of a curve. In principle, a GP model may be fit to the data directly, with the derivatives obtained by means of differentiation of the correlation function. However, it is known that this approach can be inadequate due to loss of information when differentiating. We present a new method of obtaining the derivative process by viewing this procedure as an inverse problem. We use the properties of a GP to obtain a computationally efficient fit. We illustrate our method with simulated data as well as apply it to our cosmological application.
2011 Mini Showcase—A Push-based Approach to Array Query Processing
John Bent, Gary Grider, Meghan McClelland, Noah Watkins
2011 Mini Showcase—RAID4S: Improving RAID Performance with Solid State Drives
John Bent, Meghan McClelland, Rosie Wacha
2011 Mini Showcase—SciHadoop, Array-based Query Processing in Hadoop
John Bent, Meghan McClelland, Joe Buck
ASCR11—DAMASC Adding Data Management Services to Parallel File Systems
James Ahrens, John Bent, Gary Grider, Meghan McClelland, James Nunez, Kleoni Ioannidou, Joe Buck, Noah Watkins, Scott Brandt, Maya Gokhale, Carlos Maltzahn, Neoklis Polyzotis, Wang-Chiew Tan
CIKM11—On Bias Problem in Relevance Feedback
Lanbo Zhang, Yi Zhang
HEC FSIO 11—Push-based Approach to Scientific Data Processing
John Bent, Meghan McClelland, Noah Watkins, Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis
SC11—FLAMBES: Evolving Fast Performance Models
John Bent, Stephan Eidenbenz, Meghan McClelland, Adam Crume, Carlos Maltzahn

2011 Presentations

2011 ISSDM Day—Computer Vision using Human Computation
James Davis
Computer-mediated human micro-labor markets allow human effort to be treated as a programmatic function call. We can characterize these platforms as Human Processing Units (HPU). HPU computation can be more accurate than complex CPU based algorithms on some important computer vision tasks. We also argue that HPU computation can be cheaper than state-of-the-art CPU based computation. I'll give some examples of simple computer vision tasks that we have evaluated, and speculate on whether a finite computer vision instruction set is possible. The instruction set would allow most computer vision problems to be coded from the base instructions, and the instructions themselves would be made robust with the help of human computation.
2011 ISSDM Day—Crowdsight: Rapidly Prototyping Visual Processing Apps
Reid Porter, Mario Rodriguez, James Davis
We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which reliable algorithms do not yet exist, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system, freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.
2011 ISSDM Day—Divergent Physical Design Tuning
John Bent, Gary Grider, Michael Lang, Meghan McClelland, James Nunez, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis
We introduce a new method for tuning the physical design of replicated databases. Our method installs a different (divergent) index configuration to each replica, thus specializing replicas for different subsets of the database workload. We analyze the space of divergent designs and show that there is a tension between the specialization of each replica and the ability to load-balance the database workload across different replicas. Based on our analysis, we develop an algorithm to compute good divergent designs that can balance this trade-off. Experimental results demonstrate the efficacy of our approach.
2011 ISSDM Day—Filtering Semi-Structured Documents Based on Faceted Feedback
Carla Kuiken, Lanbo Zhang, Yi Zhang
Existing adaptive filtering systems learn user profiles based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature "Language: Spanish"; a researcher working on HIV knows an article with the medical subject "MeSH: AIDS" is very likely to be interesting to him/her.

Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for filtering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user profile-learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC filtering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for filtering. The new user profile learning algorithm can effectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classification tasks in previous work.
2011 ISSDM Day—FLAMBES: Evolving Fast Performance Models
John Bent, Stephan Eidenbenz, Meghan McClelland, Adam Crume, Carlos Maltzahn, Neoklis Polyzotis, Manfred Warmuth
Large clusters and supercomputers are simulated to aid in design. Many devices, such as hard drives, are slow to simulate. Our approach is to use a genetic algorithm to fit parameters for an analytical model of a device. Fitting focuses on aggregate accuracy rather than request-level accuracy since individual request times are irrelevant in large simulations. The model is fitted to traces from a physical device or a known device-accurate model. This is done once, offline, before running the simulation. Execution of the model is fast, since it only requires a modest amount of floating point math and no event queuing. Only a few floating-point numbers are needed for state. Compared to an event-driven model, this trades a little accuracy for a large gain in performance.
2011 ISSDM Day—Halo Finder vs Local Extractors: Similarities and Differences
Christopher Brislawn, Uliana Popov, Alex Pang
Multi-streaming events are of great interest to astrophysics because they are associated with the formation of largescale structures (LSS) such as halos, filaments and sheets. Until recently, these events were studied using scalar density field only. In this talk, we present a new approach that takes into account the velocity field information in finding these multistreaming events. Six different velocity based feature extractors are defined, and their findings are compared to a halo finder results.
2011 ISSDM Day—Insertion-optimized File System
James Ahrens, Michael Lang, Latchesar Ionkov, Scott Brandt, Carlos Maltzahn
Gostor is an experimental platform for testing new file storage ideas for post POSIX usage. Gostor provides greater flexibility for manipulating the data within the file, including inserting and deleting data anywhere in the file, creating and removing holes in the data, etc. Each modification of the data creates a new file. Gostor doesn't implement any ways of organizing the files in hierarchical structures, or mapping them to strings. Thus Gostor can be used to implement standard file systems as well as experimenting with new ways of storing and accessing users' data.
2011 ISSDM Day—Keynote Address: “Los Alamos National Laboratory, a Unique, Irreplaceable, National Resource in the Department of Energy”
Gary Grider
The talk will provide an unclassified overview of the Los Alamos National Laboratory, its people, programs, and capabilities. The talk touches on much of the diverse science going on that the laboratory in areas such as materials, biology, cosmology, energy, and climate. A drill down in the area of information science, computer science, and high performance computing, is also provided.
2011 ISSDM Day—Managing High-Bandwidth Real-Time Data Storage
John Bent, HB Chen, Gary Grider, Meghan McClelland, James Nunez, David Bigelow, Scott Brandt
In an information-driven world, the ability to capture and store data in real-time is of the utmost importance. The scope and intent of such data capture, however, varies widely. Individuals record television programs for later viewing, governments maintain vast sensor networks to warn against calamity, scientists conduct experiments requiring immense data collection, and automated monitoring tools supervise a host of processes which human hands rarely touch. All such tasks have the same basic requirements -- guaranteed capture of streaming real-time data -- but with greatly differing parameters. Our ability to process and interpret data has grown faster than our ability store and manage it, which has led to the curious condition of being able to recognize the importance of data without being able to store it, and hence unable to later profit by it. 3
Traditional storage mechanisms are not well suited to manage this type of data and we have developed a large-scale ring buffer storage architecture to handle it. Our system is well suited to both large and small data elements, has a native indexing mechanism, and can maintain reliability in the face of hardware failure. Strong performance guarantees can be made and kept, and quality of service requirements maintained.
2011 ISSDM Day—Push-based Processing of Scientific Data
John Bent, Meghan McClelland, James Nunez, Noah Watkins, Scott Brandt, Carlos Maltzahn
Large-scale scientific data is collected through experiment and produced by simulation. This data in turn is commonly interrogated using ad-hoc analysis queries, and visualized with differing interactivity requirements. At extreme scale this data can be too large to store multiple copies, or may be easily accessible for only a short period of time. In either case, multiple consumers must be able to interact with the data. Unfortunately, as the number of concurrent users accessing storage media increases the throughput can decrease significantly. This performance degradation is due to the induced random access pattern that results from uncoordinated I/O streams. One common approach to this problem is to use collective I/O, unfortunately this is difficult to do for many independent computations. We are investigating a data centric, push-based approach inspired by work within the database community that has achieved an order of magnitude increase in throughput for concurrent query processing. A push-based approach to query processing uses a single data stream originating off of storage media rather than allowing multiple requests to compete, and utilizes work and data sharing opportunities exposed through query semantics. There are many challenges that exist in this work, notably supporting a distributed execution environment, providing a mix of access performance requirements (throughput vs. latency), and support for multiple data models including relational and arraybased.
2011 ISSDM Day—QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
Sasha Ames, Maya Gokhale
File system metadata management has become a bottleneck for many data-intensive applications that rely on highperformance file systems. Part of the bottleneck is due to the limitations of an almost 50-year-old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100MB. Today's highperformance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to handle complex relationships among data.

Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems.

To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application. Results from our QMDS prototype show the effectiveness of this approach. Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.
2011 ISSDM Day—Rad-Flows: Buffering for Predictable Communications
Kleoni Ioannidou, Scott Brandt
Real-time systems and applications are becoming increasingly complex and often comprise multiple communicating tasks. The management of the individual tasks is well understood, but the interaction of communicating tasks with different timing characteristics is less well understood. We discuss several representative inter-task communication flows via reserved memory buffers (possibly interconnected via a real-time network) and present RAD-Flows, a model for managing these interactions. We provide proofs and simulation results demonstrating the correctness and effectiveness of RAD-Flows, allowing system designers to determine the amount of memory required based upon the characteristics of the interacting tasks and to guarantee real-time operation of the system as a whole.
2011 ISSDM Day—RAID4S: Supercharging RAID Small Writes with SSD
John Bent, Gary Grider, Meghan McClelland, James Nunez, Rosie Wacha, Scott Brandt, Carlos Maltzahn
Parity-based RAID techniques improve data reliability and availability, but at a significant performance cost, especially for small writes. Flash-based solid state drives (SSDs) provide faster random I/O and use less power than hard drives, but are too expensive to substitute for all of the drives in most large-scale storage systems. We present RAID4S, a costeffective, high-performance technique for improving RAID small-write performance using SSDs for parity storage in a diskbased RAID array. Our results show that a 4HDD+1SSD RAID4S array achieves throughputs 3.3X better than a similar 4+1 RAID4 array and 1.75X better than a 4+1 RAID5 array on small-write-intensive workloads. RAID4S has no performance penalty on disk workloads consisting of up to 90% reads and its benefits are enhanced by the effects of file systems and caches.
2011 ISSDM Day—SciHadoop: Array-based Query Processing in Hadoop
James Ahrens, John Bent, Gary Grider, Michael Lang, Joe Buck, Scott Brandt, Maya Gokhale, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Wang-Chiew Tan
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce SciHadoop, a Hadoop plugin allowing scientists to specify logical queries over arraybased data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a SciHadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
2011 ISSDM Day—The Stanford-UC Santa Cruz Project for Cooperative Computing with Algorithms, Data and People
Neoklis Polyzotis
This talk will provide an overview of the SCOOP project, whose broad theme is to leverage people as processing units to achieve some global objective. A primary focus of SCOOP is to optimize the usage of human computation in order to use as few resources (e.g., time, money) as possible while maximizing the quality of the final output. Our approach is based on the principle of declarative languages that has been applied very successfully in database systems. The talk will describe the main research thrusts in SCOOP and some of our recent accomplishments.
2011 ISSDM Day—Using reputation systems to increase content quality: lessons from Wikipedia and Google Maps
Luca de Alfaro
2011 ISSDM Day—Viral Genomics and the Semantic Web
Tracy Holsclaw
Gaussian process (GP) models provide non-parametric methods to fit continuous curves observed with noise.
Motivated by our investigation of dark energy, we develop a GP-based inverse method that allows for the direct estimation of the derivative of a curve. In principle, a GP model may be fit to the data directly, with the derivatives obtained by means of differentiation of the correlation function. However, it is known that this approach can be inadequate due to loss of information when differentiating. We present a new method of obtaining the derivative process by viewing this procedure as an inverse problem. We use the properties of a GP to obtain a computationally efficient fit. We illustrate our method with simulated data as well as apply it to our cosmological application.
2011 ISSDM Day—Viral Genomics and the Semantic Web
Carla Kuiken
The “HIV database and analysis platform” has been maintained in Los Alamos for 22 years, and has grown to be an internationally renowned resource for HIV data analysis. It is in the process of expanding to include hepatitis C virus and hemorrhagic fever viruses; the eventual goal is to make it a universal viral resource. This expansion necessitates much greater reliance on external data and information sources. These resources rarely use the same identifies and frequently contain annotator- and submitter-specific language. While efforts have been underway for some time to standardize and cross-link biological information on the web, there still is a long way to go. I will describe current status of the “Viral data analysis platform”, the (semantic) problems we have grappled with, and the local and global efforts at amelioration.
HCOMP'11—CrowdSight: Rapidly Prototyping Intelligent Visual Processing Apps
Reid Porter, Mario Rodriguez, James Davis
IBM—Filtering Semi-Structured Documents Based on Faceted Feedback
Lanbo Zhang, Yi Zhang
ICS2011—On the Role of NVRAM in Data Intensive HPC Architectures
Sasha Ames, Maya Gokhale
NAS2011—QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
Sasha Ames, Maya Gokhale, Carlos Maltzahn
RTAS2011—On the Role of NVRAM in Data Intensive HPC Architectures
Roberto Pineiro, Kleoni Ioannidou, Scott Brandt, Carlos Maltzahn
SC11—SciHadoop: Array-based Query Processing in Hadoop
Kleoni Ioannidou, Joe Buck, Noah Watkins, Jeff LeFevre, Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
SIGIR11—Filtering Semi-Structured Documents Based on Faceted Feedback
Lanbo Zhang, Yi Zhang
Special Interest Group on Information Retrievel (SIGIR)

2010 Panels

PDSW10—5th Petascale Data Storage Workshop Supercomputing '10
Carlos Maltzahn

2010 Posters

2010 ISSDM Day: Eye Tracking for Personalized Photography
Sriram Swaminarayan, Steve Scher, James Davis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: File System Trace and Replay
James Ahrens, John Bent, Carolyn Connor, Gary Grider, Joe Buck, Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: File System Trace and Replay
James Ahrens, John Bent, Carolyn Connor, Gary Grider, Noah Watkins, Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: How do multi streaming regions form and evolve?
Katrin Heitmann, Uliana Popov, Alex Pang
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Location-based Object Detection
James Theiler, Damian Eads, David Helmbold
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Managing High-Bandwidth Real-Time Data Storage
John Bent, HB Chen, David Bigelow, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: RAID4S Hardware Performance with a Linux Software RAID
John Bent, Gary Grider, Meghan McClelland, James Nunez, Rosie Wacha, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Scalable Simulation of Parallel File Systems
John Bent, Gary Grider, Meghan McClelland, James Nunez, Esteban Molina-Estolano, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Statistical Modeling of Dark Energy and the Cosmological Constants
Ujjaini Alam, Salman Habib, Katrin Heitmann, David Higdon, Tracy Holsclaw, Herbie Lee, Bruno Sanso
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: WikiTrust Turning Wikipedia Quantity into Quality
Shelly Spearing, Ian Pye, Bo Adler, Luca de Alfaro
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Adaptive Information Filtering
Carla Kuiken, Lanbo Zhang, Yi Zhang
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Memory as an I/O bottleneck - facts and consequences for high-performance data management.
Tim Kaldewey, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: On-line Index Selection for Physical Database Tuning
John Bent, Karl Schnaitter, Neoklis Polyzotis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
Sasha Ames, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: RAD-FLOWS: Buffer management for predictable performance
Roberto Pineiro, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 Mini-Showcase—File System Trace and Replay Redux
Meghan McClelland, Noah Watkins
2010 Mini-Showcase—Improving RAID Performance with Solid State Drives
Meghan McClelland, Rosie Wacha
2010 Mini-Showcase—Managing High-Bandwidth Real-Time Streaming Data
John Bent, HB Chen, David Bigelow
2010 Mini-Showcase—Scalable Simulation of Parallel Filesystems
John Bent, Esteban Molina-Estolano
CIKM10—Discriminative Factored Prior Model for Personalized Content-Based Recommendation
Lanbo Zhang, Yi Zhang
CMU PDL Visit Day—PLFS and HDFS: Enabling Parallel Filesystem Semantics In The Cloud
John Bent, Esteban Molina-Estolano, Milo Polte, Scott Brandt, Garth Gibson, Carlos Maltzahn
Parallel Data Lab (PDL)
Cosmic Calibration - Statistical Modeling for Dark Energy
Ujjaini Alam, Salman Habib, Katrin Heitmann, Tracy Holsclaw, Herbie Lee, Bruno Sanso
EuroSys 2010—RAID4S: Adding SSDs to RAID Arrays
John Bent, Rosie Wacha, Scott Brandt, Carlos Maltzahn
PDSW10—PLFS and HDFS: Enabling Parallel Filesystem Semantics In The Cloud
John Bent, Esteban Molina-Estolano, Milo Polte, Scott Brandt, Garth Gibson, Maya Gokhale
PDSW10—QMDS: A File System Metadata Service Supporting a Graph Data Model-Based Query Language
Sasha Ames, Maya Gokhale, Carlos Maltzahn
USENIX FAST 10—Design and Implementation of a Metadata-Rich File System
Sasha Ames, Maya Gokhale, Carlos Maltzahn
File and Storage Technologies (FAST)
USENIX FAST 10—Enabling Scientific Application I/O on Cloud FileSystems
John Bent, Meghan McClelland, Esteban Molina-Estolano, Milo Polte, Scott Brandt, Garth Gibson, Carlos Maltzahn
File and Storage Technologies (FAST)
USENIX FAST 10—Energy Efficient Striping in the Energy Aware Virtual File System
Adam Manzanares, Xiao Qin
File and Storage Technologies (FAST)
USENIX FAST 10—InfoGarden: A Casual-Game Approach to Digital Archive Management
Carlos Maltzahn, Michael Mateas, Jim Whitehead
File and Storage Technologies (FAST)
USENIX FAST 10—RAID4S: Adding SSDs to RAID Arrays
John Bent, Rosie Wacha, Scott Brandt, Carlos Maltzahn
File and Storage Technologies (FAST)

2010 Presentations

2010 ISSDM Day: Eye Tracking for Personalized Photography
Sriram Swaminarayan, Steve Scher, James Davis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: File System Trace and Replay
James Ahrens, John Bent, Carolyn Connor, Gary Grider, Noah Watkins, Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: File System Trace and Replay
James Ahrens, John Bent, Carolyn Connor, Gary Grider, Joe Buck, Scott Brandt, Carlos Maltzahn, Neoklis Polyzotis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: How do multi streaming regions form and evolve?
Katrin Heitmann, Uliana Popov, Alex Pang
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Location-based Object Detection
James Theiler, Damian Eads, David Helmbold
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Managing High-Bandwidth Real-Time Data Storage
John Bent, HB Chen, David Bigelow, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: RAID4S Hardware Performance with a Linux Software RAID
John Bent, Gary Grider, Meghan McClelland, James Nunez, Rosie Wacha, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Scalable Simulation of Parallel File Systems
John Bent, HB Chen, Gary Grider, Meghan McClelland, James Nunez, Esteban Molina-Estolano, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Statistical Modeling of Dark Energy and the Cosmological Constants
Ujjaini Alam, Salman Habib, Katrin Heitmann, David Higdon, Tracy Holsclaw, Herbie Lee, Bruno Sanso
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: WikiTrust Turning Wikipedia Quantity into Quality
Shelly Spearing, Ian Pye, Bo Adler, Luca de Alfaro
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Adaptive Information Filtering
Carla Kuiken, Lanbo Zhang, Yi Zhang
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Exa‐Scale FSIO - Can we get there? Can we afford to?
Gary Grider
Abstract: This talk will describe the anticipated DOE Exascale initiative, a prospective very large extreme scale supercomputing program being formulated by DOE Office of Science and DOE NNSA. Motivations for the program as well as how the program may proceed will be presented. Anticipated Exascale machine dimensions will be provided as well. An analysis of the costs of providing scalable file systems and I/O for these future very large supercomputers will be examined in detail.

Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Investigating Efficient Real-Time Performance Guarantees on Storage Networks
Andrew Shewmaker
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: Memory as an I/O bottleneck - facts and consequences for high-performance data management.
Tim Kaldewey, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: On-line Index Selection for Physical Database Tuning
John Bent, Karl Schnaitter, Neoklis Polyzotis
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
Sasha Ames, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
2010 ISSDM Day: RAD-FLOWS: Buffer management for predictable performance
Roberto Pineiro, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
2010 Mini-Showcase—File System Trace and Replay Redux
Meghan McClelland, Noah Watkins
2010 Mini-Showcase—Improving RAID Performance with Solid State Drives
Meghan McClelland, Rosie Wacha
2010 Mini-Showcase—Managing High-Bandwidth Real-Time Streaming Data
John Bent, HB Chen, David Bigelow
2010 Mini-Showcase—Scalable Simulation of Parallel Filesystems
John Bent, Esteban Molina-Estolano
26th IEEE Symposium (MSST2010): Quality of Service Guarantees in High-Bandwidth, Real-Time Streaming Data Storage
John Bent, HB Chen, David Bigelow, Scott Brandt
EuroSys 2010—RAID4S: Adding SSDs to RAID Arrays
John Bent, Rosie Wacha, Scott Brandt, Carlos Maltzahn
PAN 2010: Detecting Wikipedia Vandalism using WikiTrust
Ian Pye, Bo Adler, Luca de Alfaro
PAN 2010 Lab: Uncovering Plagiarism, Authorship, and Social Software Misuse
SIGIR10—Interactive Retrieval Based on Faceted Feedback
Lanbo Zhang, Yi Zhang
Special Interest Group on Information Retrievel (SIGIR)
SIGMOD 2010—An Automated, yet Interactive and Portable DB designer
Karl Schnaitter, Neoklis Polyzotis
Association for Computing Machinery (ACM)
USENIX FAST 10—Enabling Scientific Application I/O on Cloud FileSystems
John Bent, Meghan McClelland, Esteban Molina-Estolano, Milo Polte, Scott Brandt, Garth Gibson, Carlos Maltzahn
File and Storage Technologies (FAST)
USENIX FAST 10—Energy Efficient Striping in the Energy Aware Virtual File System
Adam Manzanares, Xiao Qin
File and Storage Technologies (FAST)
USENIX FAST 10—InfoGarden: A Casual-Game Approach to Digital Archive Management
Carlos Maltzahn, Michael Mateas, Jim Whitehead
File and Storage Technologies (FAST)

2009 Posters

ISSDM Day: Comic Calibration - Statistical Modeling for Dark Energy
Ujjaini Alam, Salman Habib, Katrin Heitmann, Tracy Holsclaw, Herbie Lee, Bruno Sanso
Institute for Scalable Scientific Data Management (ISSDM)
UCSC Engr Day 2009: Reducing Power Consumption by Incorporating Flash Memory into RAID Storage
Rosie Wacha

2009 Presentations

2009 ISSDM Day: Abstract Storage: Moving file format-specific abstractions into peta-byte filesystems
Joe Buck, Noah Watkins, Scott Brandt, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
A Flexible Scheduling Framework (for Linux): Supporting Multiple Programming Models with Arbitrary Semantics MultipA le Programming Models  with Arbitrary Semantics 
Noah Watkins
Abstract Storage: Moving File Format-Specific Abstractions into Petabyte-Scale Storage Systems
Joe Buck, Noah Watkins, Scott Brandt, Carlos Maltzahn
FAST 09 WiP: File System Load Balancing
John Bent, Esteban Molina-Estolano, Scott Brandt, Carlos Maltzahn
File and Storage Technologies (FAST)
Work in Progress (WiP)
HPC/ISTI Summer Student Project Mini-Showcase: Managing High-Bandwidth Real-Time Streaming Data
John Bent, Gary Grider, James Nunez, David Bigelow, Scott Brandt
ISSDM Day: SNS: A Simple Model for Understanding Optimal Hard Real-Time Multiprocessor Scheduling
Greg Levin
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Adaptive Information Filtering and its Application in Medical Domain
Carla Kuiken, Lanbo Zhang, Yi Zhang
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Buffer-Cache for Predictable and Consistent Performance
Roberto Pineiro
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Data Management of Metadata-Rich File Systems
Sasha Ames, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Efficient Guaranteed Disk I/O Performance Management
Anna Povzner
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Exploring Multi-Streaming in the Universe
Alex Pang
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Eye Tracking for Blob Tracking
Sriram Swaminarayan, Steve Scher, James Davis
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: High Performance Data Management
Tim Kaldewey, Scott Brandt, Andrea DiBlas
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Improving RAID-Based Storage Systems with Flash Memory
John Bent, Gary Grider, James Nunez, Rosie Wacha, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Investigating Efficient Real-time Performance Guarantees on Storage Networks
Andrew Shewmaker
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Managing High-Bandwidth Real-Time Data Storage
John Bent, HB Chen, David Bigelow, Scott Brandt
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Performance Analysis of Mixed Distributed Filesystem Workloads
John Bent, Esteban Molina-Estolano, Scott Brandt, Carlos Maltzahn
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Physical Database Tuning with Interaction-Aware Index Selection
Karl Schnaitter
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: PLFS: Parallel LFS
John Bent
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: SCRAWL: A Semantic Crawler for the Wikipedia and Beyond
Jorge Roman, Shelly Spearing, Ian Pye, Luca de Alfaro
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: Semantic Web for Search
Jessica Gronski
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM Day: WikiTrust: Experience is what you get...
Bo Adler, Luca de Alfaro
Institute for Scalable Scientific Data Management (ISSDM)
ISSDM: Cosmic Calibration - Statistical Modeling for Dark Energy and the Cosmological Constants
Tracy Holsclaw
Institute for Scalable Scientific Data Management (ISSDM)
UCSC Engr Day 2009: Content Driven Reputation
Bo Adler, Luca de Alfaro
Wikimania 2009: Reputation and Contributions
Ian Pye, Bo Adler, Luca de Alfaro
WikiSym 2009: Measuring Wikipedia
Ian Pye, Bo Adler, Luca de Alfaro

2008 Posters

Content-Driven Reputation
Ian Pye, Bo Adler
Cosmic Calibration - Statistical Modeling for Dark Energy (for UCSC Engr Days)
Katrin Heitmann, David Higdon, Tracy Holsclaw, Herbie Lee, Bruno Sanso
Making Real Games Virtual for International Conference on Pattern Recognition
Steve Scher, James Davis
Object Raid
David Bigelow
Robust Content-Driven Reputation. for ACM AISEC 2008 poster
Ian Pye, Luca de Alfaro
SSRC08 Scaling in Ceph
Esteban Molina-Estolano
SSRC08 TCP Collapse
Andrew Shewmaker
The UCSC 4th annual graduate research symposium, May '08 Visualizing Uncertainty in Multivalue Field Flow
Eddy Chandra, Alex Pang
USENIX FAST08 Adapting RAID Methods for Use in Object Storage Systems
David Bigelow, Scott Brandt, Carlos Maltzahn
USENIX FAST08 Dynamic Load Balancing in Ceph
Esteban Molina-Estolano, Carlos Maltzahn
USENIX FAST08 How Private are Home Directories?
Carlos Maltzahn
USENIX FAST08 RADoN: QoS in Storage Networks
Andrew Shewmaker, Tim Kaldewey, Scott Brandt
USENIX FAST08 Ringer: A Global-Scale Lightweight P2P File Service
Ian Pye, Scott Brandt, Carlos Maltzahn
USENIX FAST08 Virtualizing Disk Performance with Fahrrad
Scott Brandt

2008 Presentations

Architecture of Reputation Analusis WikiSym 2008
Bo Adler, Luca de Alfaro
Energy-Reliability Tradeoffs in Sensor Network Storage HOTEMNETS08
Rosie Wacha, Kevin Greenan, Darrell Long, Ethan Miller
Entropy Regularized LPBoost Alt 08 talk
Karen Glocer, Manfred Warmuth
Exploring Human Body Shape and Motion at LANL
James Davis
Learning Permutations with Exponential Weights
Damian Eads, David Helmbold
Date: 07/30/2008 2:00 pm - 3:00 pm

CNLS Conference Room (TA-3, Bldg 1690)

Learning Permutations with Exponential Weights

David Helmbold
UC Santa Cruz

We consider learning permutations in the on-line setting. In this setting the algorithm's goal is to have small regret, i.e. the cumulative loss of the algorithm on any sequence of trials should not be too much greater than the loss of the best permutation in hindsight. We present a new algorithm that maintains a doubly stochastic weight matrix representing its uncertainty about the best permutation and makes predictions by decomposing this weight matrix into a convex combination of permutations. After receiving feedback information, the algorithm updates its weight matrix for the next trial. A new insight allows us to prove an optimal (up to small constant factors) bound on the regret our algorithm despite the fact that there is no closed form for the re-balanced weight matrix. This regret bound is significantly better than the bounds on either Kalai and Vempala's more computationally efficient "Follow the Perturbed Leader" algorithm or the straightforward but expensive method that explicitly tracks each permutation.
Making Real Games Virtual for International Conference on Pattern Recognition
Steve Scher, James Davis
Managing the Performance of Large Distributed Storage Systems
Scott Brandt
Robust Content-Driven Reputation
Ian Pye, Bo Adler, Luca de Alfaro
Robust Content-Driven Reputation. for ACM AISEC 2008 Presentation
Ian Pye, Luca de Alfaro
SSRC08 - Key Note HECFSIO Status
Gary Grider
SSRC08 Object Raid
David Bigelow
SSRC08 Scaling in Ceph
Esteban Molina-Estolano
SSRC08 TCP Collapse
Andrew Shewmaker
USENIX FAST08 Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage
Kevin Greenan, Mark Storer, Ethan Miller
USENIX FAST08 WIP Adapting RAID Methods for Use in Object Storage Systems
David Bigelow, Scott Brandt, Carlos Maltzahn
USENIX FAST08 WIP Dynamic Load Balancing in Ceph
Esteban Molina-Estolano, Carlos Maltzahn
USENIX FAST08 WIP How Private are Home Directories?
Carlos Maltzahn
USENIX FAST08 WIP Ringer: A Global-Scale Lightweight P2P File Service
Ian Pye, Scott Brandt, Carlos Maltzahn
USENIX FAST08 WIP Virtualizing Disk Performance with Fahrrad
Scott Brandt

2007 Posters

IEEE MSST07 Providing Quality of Service Support in Object-Based File System
Joel Wu, Scott Brandt
Measuring I/O Perfomance in Scalable File Systems
Rosie Wacha
SC07-PDSW Ceph Scalability
Scott Brandt
SC07-PDSW RADOS: A Fast, Scalable, and Reliable Storage Service for Petabyte-scale Storage Clusters
Andrew Leung, Scott Brandt, Carlos Maltzahn
SSRC07 Alternative Reliability in Ceph
David Bigelow
SSRC07 Hadoop on Ceph
Esteban Molina-Estolano
SSRC07 Hidden Markov Modeling of Parallel I/O
Rosie Wacha
SSRC07 Large-scale P2P Document Management
Ian Pye
SSRC07 Providing QoS Support in Object File Systems
Joel Wu
SSRC07 Scalable Security for Petascale File Systems
Andrew Leung
SSRC07 Using Comprehensive Analysis for Perf Debugging in Dist Storage
Eric Lalonde
SSRC07 Utilization as an Efficient Guarnateeable Metric of Disk Perf
Tim Kaldewey
SSRC08 Self Managed Raid
Rosie Wacha, Darrell Long
USENIX FAST07 Secure, Archival Storage with POTSHARDS
Kevin Greenan, Mark Storer, Ethan Miller

2007 Presentations

IEEE MSST07 Providing Quality of Service Support in Object-Based File System
Joel Wu, Scott Brandt
IEEE MSST07 Using Comprehensive Analysis for Performance Debugging in Distributed Storage Systems
Eric Lalonde, Andrew Leung, James Davis, Carlos Maltzahn
SC07 Scalable Security for Petascale Parallel File Systems
Andrew Leung, Scott Brandt
SC07-PDSW Ceph Scalability
Scott Brandt
SC07-PDSW RADOS: A Fast, Scalable, and Reliable Storage Service for Petabyte-scale Storage Clusters
Andrew Leung, Scott Brandt, Carlos Maltzahn
SSRC07 Alternative Reliability in Ceph
David Bigelow
SSRC07 Hadoop on Ceph
Esteban Molina-Estolano
SSRC07 Hidden Markov Modeling of Parallel I/O
Rosie Wacha
SSRC07 Large-scale P2P Document Management
Ian Pye
SSRC07 Providing QoS Support in Object File Systems
Joel Wu
SSRC07 RAD-based Guaranteed Virtual Resource Management
Scott Brandt
SSRC07 Scalable Security for Petascale File Systems
Andrew Leung
SSRC07 Using Comprehensive Analysis for Perf Debugging in Dist Storage
Eric Lalonde
SSRC07 Utilization as an Efficient Guarnateeable Metric of Disk Perf
Tim Kaldewey
SSRC08 Distributed Metadata Management
Carlos Maltzahn
SSRC08 Potshards: Secure Long-Term Storage
Ethan Miller
SSRC08 Self Managed Raid
Rosie Wacha, Darrell Long
USENIX FAST07 Scaling Security for Big, Parallel File Systems
Andrew Leung, Ethan Miller
USENIX FAST07 WIP CompulsiveFS: Making NVRAM Suitable for Extremely Reliable Storage
Kevin Greenan, Ethan Miller
USENIX FAST07 WIP Secure, Archival Storage with POTSHARDS
Kevin Greenan, Mark Storer, Ethan Miller

Science LANL Institutes Information Science and Technology Institute

Contacts

2012 Posters

2012 Presentations

2011 Posters

2011 Presentations

2010 Panels

2010 Posters

2010 Presentations

2009 Posters

2009 Presentations

2008 Posters

2008 Presentations

2007 Posters

2007 Presentations

2010 ISTI Interaction Summaries

Related Links

Open Source Software and Data Releases