Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.
Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.
Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
Computer-mediated human micro-labor markets allow human effort to be treated as a programmatic function call. We can characterize these platforms as Human Processing Units (HPU). HPU computation can be more accurate than complex CPU based algorithms on some important computer vision tasks. We also argue that HPU computation can be cheaper than state-of-the-art CPU based computation. I'll give some examples of simple computer vision tasks that we have evaluated, and speculate on whether a finite computer vision instruction set is possible. The instruction set would allow most computer vision problems to be coded from the base instructions, and the instructions themselves would be made robust with the help of human computation.
We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which reliable algorithms do not yet exist, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system, freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.
We introduce a new method for tuning the physical design of replicated databases. Our method installs a different (divergent) index configuration to each replica, thus specializing replicas for different subsets of the database workload. We analyze the space of divergent designs and show that there is a tension between the specialization of each replica and the ability to load-balance the database workload across different replicas. Based on our analysis, we develop an algorithm to compute good divergent designs that can balance this trade-off. Experimental results demonstrate the efficacy of our approach.
Existing adaptive filtering systems learn user profiles based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature "Language: Spanish"; a researcher working on HIV knows an article with the medical subject "MeSH: AIDS" is very likely to be interesting to him/her.
Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for filtering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user profile-learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC filtering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for filtering. The new user profile learning algorithm can effectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classification tasks in previous work.
Large clusters and supercomputers are simulated to aid in design. Many devices, such as hard drives, are slow to simulate. Our approach is to use a genetic algorithm to fit parameters for an analytical model of a device. Fitting focuses on aggregate accuracy rather than request-level accuracy since individual request times are irrelevant in large simulations. The model is fitted to traces from a physical device or a known device-accurate model. This is done once, offline, before running the simulation. Execution of the model is fast, since it only requires a modest amount of floating point math and no event queuing. Only a few floating-point numbers are needed for state. Compared to an event-driven model, this trades a little accuracy for a large gain in performance.
Multi-streaming events are of great interest to astrophysics because they are associated with the formation of largescale structures (LSS) such as halos, filaments and sheets. Until recently, these events were studied using scalar density field only. In this talk, we present a new approach that takes into account the velocity field information in finding these multistreaming events. Six different velocity based feature extractors are defined, and their findings are compared to a halo finder results.
Gostor is an experimental platform for testing new file storage ideas for post POSIX usage. Gostor provides greater flexibility for manipulating the data within the file, including inserting and deleting data anywhere in the file, creating and removing holes in the data, etc. Each modification of the data creates a new file. Gostor doesn't implement any ways of organizing the files in hierarchical structures, or mapping them to strings. Thus Gostor can be used to implement standard file systems as well as experimenting with new ways of storing and accessing users' data.
In an information-driven world, the ability to capture and store data in real-time is of the utmost importance. The scope and intent of such data capture, however, varies widely. Individuals record television programs for later viewing, governments maintain vast sensor networks to warn against calamity, scientists conduct experiments requiring immense data collection, and automated monitoring tools supervise a host of processes which human hands rarely touch. All such tasks have the same basic requirements -- guaranteed capture of streaming real-time data -- but with greatly differing parameters. Our ability to process and interpret data has grown faster than our ability store and manage it, which has led to the curious condition of being able to recognize the importance of data without being able to store it, and hence unable to later profit by it. 3
Traditional storage mechanisms are not well suited to manage this type of data and we have developed a large-scale ring buffer storage architecture to handle it. Our system is well suited to both large and small data elements, has a native indexing mechanism, and can maintain reliability in the face of hardware failure. Strong performance guarantees can be made and kept, and quality of service requirements maintained.
Large-scale scientific data is collected through experiment and produced by simulation. This data in turn is commonly interrogated using ad-hoc analysis queries, and visualized with differing interactivity requirements. At extreme scale this data can be too large to store multiple copies, or may be easily accessible for only a short period of time. In either case, multiple consumers must be able to interact with the data. Unfortunately, as the number of concurrent users accessing storage media increases the throughput can decrease significantly. This performance degradation is due to the induced random access pattern that results from uncoordinated I/O streams. One common approach to this problem is to use collective I/O, unfortunately this is difficult to do for many independent computations. We are investigating a data centric, push-based approach inspired by work within the database community that has achieved an order of magnitude increase in throughput for concurrent query processing. A push-based approach to query processing uses a single data stream originating off of storage media rather than allowing multiple requests to compete, and utilizes work and data sharing opportunities exposed through query semantics. There are many challenges that exist in this work, notably supporting a distributed execution environment, providing a mix of access performance requirements (throughput vs. latency), and support for multiple data models including relational and arraybased.
File system metadata management has become a bottleneck for many data-intensive applications that rely on highperformance file systems. Part of the bottleneck is due to the limitations of an almost 50-year-old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100MB. Today's highperformance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to handle complex relationships among data.
Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems.
To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application. Results from our QMDS prototype show the effectiveness of this approach. Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.
Real-time systems and applications are becoming increasingly complex and often comprise multiple communicating tasks. The management of the individual tasks is well understood, but the interaction of communicating tasks with different timing characteristics is less well understood. We discuss several representative inter-task communication flows via reserved memory buffers (possibly interconnected via a real-time network) and present RAD-Flows, a model for managing these interactions. We provide proofs and simulation results demonstrating the correctness and effectiveness of RAD-Flows, allowing system designers to determine the amount of memory required based upon the characteristics of the interacting tasks and to guarantee real-time operation of the system as a whole.
Parity-based RAID techniques improve data reliability and availability, but at a significant performance cost, especially for small writes. Flash-based solid state drives (SSDs) provide faster random I/O and use less power than hard drives, but are too expensive to substitute for all of the drives in most large-scale storage systems. We present RAID4S, a costeffective, high-performance technique for improving RAID small-write performance using SSDs for parity storage in a diskbased RAID array. Our results show that a 4HDD+1SSD RAID4S array achieves throughputs 3.3X better than a similar 4+1 RAID4 array and 1.75X better than a 4+1 RAID5 array on small-write-intensive workloads. RAID4S has no performance penalty on disk workloads consisting of up to 90% reads and its benefits are enhanced by the effects of file systems and caches.
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce SciHadoop, a Hadoop plugin allowing scientists to specify logical queries over arraybased data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a SciHadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
This talk will provide an overview of the SCOOP project, whose broad theme is to leverage people as processing units to achieve some global objective. A primary focus of SCOOP is to optimize the usage of human computation in order to use as few resources (e.g., time, money) as possible while maximizing the quality of the final output. Our approach is based on the principle of declarative languages that has been applied very successfully in database systems. The talk will describe the main research thrusts in SCOOP and some of our recent accomplishments.
The “HIV database and analysis platform” has been maintained in Los Alamos for 22 years, and has grown to be an internationally renowned resource for HIV data analysis. It is in the process of expanding to include hepatitis C virus and hemorrhagic fever viruses; the eventual goal is to make it a universal viral resource. This expansion necessitates much greater reliance on external data and information sources. These resources rarely use the same identifies and frequently contain annotator- and submitter-specific language. While efforts have been underway for some time to standardize and cross-link biological information on the web, there still is a long way to go. I will describe current status of the “Viral data analysis platform”, the (semantic) problems we have grappled with, and the local and global efforts at amelioration.
Gaussian process (GP) models provide non-parametric methods to fit continuous curves observed with noise.
Motivated by our investigation of dark energy, we develop a GP-based inverse method that allows for the direct estimation of the derivative of a curve. In principle, a GP model may be fit to the data directly, with the derivatives obtained by means of differentiation of the correlation function. However, it is known that this approach can be inadequate due to loss of information when differentiating. We present a new method of obtaining the derivative process by viewing this procedure as an inverse problem. We use the properties of a GP to obtain a computationally efficient fit. We illustrate our method with simulated data as well as apply it to our cosmological application.
Computer-mediated human micro-labor markets allow human effort to be treated as a programmatic function call. We can characterize these platforms as Human Processing Units (HPU). HPU computation can be more accurate than complex CPU based algorithms on some important computer vision tasks. We also argue that HPU computation can be cheaper than state-of-the-art CPU based computation. I'll give some examples of simple computer vision tasks that we have evaluated, and speculate on whether a finite computer vision instruction set is possible. The instruction set would allow most computer vision problems to be coded from the base instructions, and the instructions themselves would be made robust with the help of human computation.
We describe a framework for rapidly prototyping applications which require intelligent visual processing, but for which reliable algorithms do not yet exist, or for which engineering those algorithms is too costly. The framework, CrowdSight, leverages the power of crowdsourcing to offload intelligent processing to humans, and enables new applications to be built quickly and cheaply, affording system builders the opportunity to validate a concept before committing significant time or capital. Our service accepts requests from users either via email or simple mobile applications, and handles all the communication with a backend human computation platform. We build redundant requests and data aggregation into the system, freeing the user from managing these requirements. We validate our framework by building several test applications and verifying that prototypes can be built more easily and quickly than would be the case without the framework.
We introduce a new method for tuning the physical design of replicated databases. Our method installs a different (divergent) index configuration to each replica, thus specializing replicas for different subsets of the database workload. We analyze the space of divergent designs and show that there is a tension between the specialization of each replica and the ability to load-balance the database workload across different replicas. Based on our analysis, we develop an algorithm to compute good divergent designs that can balance this trade-off. Experimental results demonstrate the efficacy of our approach.
Existing adaptive filtering systems learn user profiles based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature "Language: Spanish"; a researcher working on HIV knows an article with the medical subject "MeSH: AIDS" is very likely to be interesting to him/her.
Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for filtering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user profile-learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC filtering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for filtering. The new user profile learning algorithm can effectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classification tasks in previous work.
Large clusters and supercomputers are simulated to aid in design. Many devices, such as hard drives, are slow to simulate. Our approach is to use a genetic algorithm to fit parameters for an analytical model of a device. Fitting focuses on aggregate accuracy rather than request-level accuracy since individual request times are irrelevant in large simulations. The model is fitted to traces from a physical device or a known device-accurate model. This is done once, offline, before running the simulation. Execution of the model is fast, since it only requires a modest amount of floating point math and no event queuing. Only a few floating-point numbers are needed for state. Compared to an event-driven model, this trades a little accuracy for a large gain in performance.
Multi-streaming events are of great interest to astrophysics because they are associated with the formation of largescale structures (LSS) such as halos, filaments and sheets. Until recently, these events were studied using scalar density field only. In this talk, we present a new approach that takes into account the velocity field information in finding these multistreaming events. Six different velocity based feature extractors are defined, and their findings are compared to a halo finder results.
Gostor is an experimental platform for testing new file storage ideas for post POSIX usage. Gostor provides greater flexibility for manipulating the data within the file, including inserting and deleting data anywhere in the file, creating and removing holes in the data, etc. Each modification of the data creates a new file. Gostor doesn't implement any ways of organizing the files in hierarchical structures, or mapping them to strings. Thus Gostor can be used to implement standard file systems as well as experimenting with new ways of storing and accessing users' data.
The talk will provide an unclassified overview of the Los Alamos National Laboratory, its people, programs, and capabilities. The talk touches on much of the diverse science going on that the laboratory in areas such as materials, biology, cosmology, energy, and climate. A drill down in the area of information science, computer science, and high performance computing, is also provided.
In an information-driven world, the ability to capture and store data in real-time is of the utmost importance. The scope and intent of such data capture, however, varies widely. Individuals record television programs for later viewing, governments maintain vast sensor networks to warn against calamity, scientists conduct experiments requiring immense data collection, and automated monitoring tools supervise a host of processes which human hands rarely touch. All such tasks have the same basic requirements -- guaranteed capture of streaming real-time data -- but with greatly differing parameters. Our ability to process and interpret data has grown faster than our ability store and manage it, which has led to the curious condition of being able to recognize the importance of data without being able to store it, and hence unable to later profit by it. 3
Traditional storage mechanisms are not well suited to manage this type of data and we have developed a large-scale ring buffer storage architecture to handle it. Our system is well suited to both large and small data elements, has a native indexing mechanism, and can maintain reliability in the face of hardware failure. Strong performance guarantees can be made and kept, and quality of service requirements maintained.
Large-scale scientific data is collected through experiment and produced by simulation. This data in turn is commonly interrogated using ad-hoc analysis queries, and visualized with differing interactivity requirements. At extreme scale this data can be too large to store multiple copies, or may be easily accessible for only a short period of time. In either case, multiple consumers must be able to interact with the data. Unfortunately, as the number of concurrent users accessing storage media increases the throughput can decrease significantly. This performance degradation is due to the induced random access pattern that results from uncoordinated I/O streams. One common approach to this problem is to use collective I/O, unfortunately this is difficult to do for many independent computations. We are investigating a data centric, push-based approach inspired by work within the database community that has achieved an order of magnitude increase in throughput for concurrent query processing. A push-based approach to query processing uses a single data stream originating off of storage media rather than allowing multiple requests to compete, and utilizes work and data sharing opportunities exposed through query semantics. There are many challenges that exist in this work, notably supporting a distributed execution environment, providing a mix of access performance requirements (throughput vs. latency), and support for multiple data models including relational and arraybased.
File system metadata management has become a bottleneck for many data-intensive applications that rely on highperformance file systems. Part of the bottleneck is due to the limitations of an almost 50-year-old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100MB. Today's highperformance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to handle complex relationships among data.
Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems.
To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application. Results from our QMDS prototype show the effectiveness of this approach. Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.
Real-time systems and applications are becoming increasingly complex and often comprise multiple communicating tasks. The management of the individual tasks is well understood, but the interaction of communicating tasks with different timing characteristics is less well understood. We discuss several representative inter-task communication flows via reserved memory buffers (possibly interconnected via a real-time network) and present RAD-Flows, a model for managing these interactions. We provide proofs and simulation results demonstrating the correctness and effectiveness of RAD-Flows, allowing system designers to determine the amount of memory required based upon the characteristics of the interacting tasks and to guarantee real-time operation of the system as a whole.
Parity-based RAID techniques improve data reliability and availability, but at a significant performance cost, especially for small writes. Flash-based solid state drives (SSDs) provide faster random I/O and use less power than hard drives, but are too expensive to substitute for all of the drives in most large-scale storage systems. We present RAID4S, a costeffective, high-performance technique for improving RAID small-write performance using SSDs for parity storage in a diskbased RAID array. Our results show that a 4HDD+1SSD RAID4S array achieves throughputs 3.3X better than a similar 4+1 RAID4 array and 1.75X better than a 4+1 RAID5 array on small-write-intensive workloads. RAID4S has no performance penalty on disk workloads consisting of up to 90% reads and its benefits are enhanced by the effects of file systems and caches.
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce SciHadoop, a Hadoop plugin allowing scientists to specify logical queries over arraybased data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a SciHadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
This talk will provide an overview of the SCOOP project, whose broad theme is to leverage people as processing units to achieve some global objective. A primary focus of SCOOP is to optimize the usage of human computation in order to use as few resources (e.g., time, money) as possible while maximizing the quality of the final output. Our approach is based on the principle of declarative languages that has been applied very successfully in database systems. The talk will describe the main research thrusts in SCOOP and some of our recent accomplishments.
Gaussian process (GP) models provide non-parametric methods to fit continuous curves observed with noise.
Motivated by our investigation of dark energy, we develop a GP-based inverse method that allows for the direct estimation of the derivative of a curve. In principle, a GP model may be fit to the data directly, with the derivatives obtained by means of differentiation of the correlation function. However, it is known that this approach can be inadequate due to loss of information when differentiating. We present a new method of obtaining the derivative process by viewing this procedure as an inverse problem. We use the properties of a GP to obtain a computationally efficient fit. We illustrate our method with simulated data as well as apply it to our cosmological application.
The “HIV database and analysis platform” has been maintained in Los Alamos for 22 years, and has grown to be an internationally renowned resource for HIV data analysis. It is in the process of expanding to include hepatitis C virus and hemorrhagic fever viruses; the eventual goal is to make it a universal viral resource. This expansion necessitates much greater reliance on external data and information sources. These resources rarely use the same identifies and frequently contain annotator- and submitter-specific language. While efforts have been underway for some time to standardize and cross-link biological information on the web, there still is a long way to go. I will describe current status of the “Viral data analysis platform”, the (semantic) problems we have grappled with, and the local and global efforts at amelioration.
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
Special Interest Group on Information Retrievel (SIGIR)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Parallel Data Lab (PDL)
File and Storage Technologies (FAST)
File and Storage Technologies (FAST)
File and Storage Technologies (FAST)
File and Storage Technologies (FAST)
File and Storage Technologies (FAST)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Abstract: This talk will describe the anticipated DOE Exascale initiative, a prospective very large extreme scale supercomputing program being formulated by DOE Office of Science and DOE NNSA. Motivations for the program as well as how the program may proceed will be presented. Anticipated Exascale machine dimensions will be provided as well. An analysis of the costs of providing scalable file systems and I/O for these future very large supercomputers will be examined in detail.
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
PAN 2010 Lab: Uncovering Plagiarism, Authorship, and Social Software Misuse
Special Interest Group on Information Retrievel (SIGIR)
Association for Computing Machinery (ACM)
File and Storage Technologies (FAST)
File and Storage Technologies (FAST)
File and Storage Technologies (FAST)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
File and Storage Technologies (FAST)
Work in Progress (WiP)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Institute for Scalable Scientific Data Management (ISSDM)
Date: 07/30/2008 2:00 pm - 3:00 pm
CNLS Conference Room (TA-3, Bldg 1690)
Learning Permutations with Exponential Weights
David Helmbold
UC Santa Cruz
We consider learning permutations in the on-line setting. In this setting the algorithm's goal is to have small regret, i.e. the cumulative loss of the algorithm on any sequence of trials should not be too much greater than the loss of the best permutation in hindsight. We present a new algorithm that maintains a doubly stochastic weight matrix representing its uncertainty about the best permutation and makes predictions by decomposing this weight matrix into a convex combination of permutations. After receiving feedback information, the algorithm updates its weight matrix for the next trial. A new insight allows us to prove an optimal (up to small constant factors) bound on the regret our algorithm despite the fact that there is no closed form for the re-balanced weight matrix. This regret bound is significantly better than the bounds on either Kalai and Vempala's more computationally efficient "Follow the Perturbed Leader" algorithm or the straightforward but expensive method that explicitly tracks each permutation.