Los Alamos National Laboratory
 
 

Science >  LANL Institutes >  Information Science and Technology Institute

National Security Education Center

Contacts

 

2012 Papers
2011 Papers
2010 Papers
2009 Papers
2008 Papers
2007 Papers
2006 Papers
Theses
Chart displaying papers published each year by Institutelet

    2012 Papers

  • ACMDEV12: Exploring employment opportunities through microtasks via cybercafes
    James Davis, Rajan Vaish

    Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.


  • DRepl: Optimizing Access to Application Data for Analysis and Visualization
    Carlos Maltzahn, Latchesar Ionkov, Michael Lang

    Conference: IEEE International Parallel & Distributed Processing Symposium ; 2012-05-21 - 2012-05-25 ; Shanghai, China

    Until recently most scientific applications produced data that is saved, analyzed and visualized at later time.
    In recent years, with the large increase in the amount of data and computational power available there is demand for applications to support data access in-situ, or close-to simulation to provide application steering, analytics and visualization. Data access patterns required for these activities are usually different than the data layout produced by the application. In most of the large HPC clusters scientific data is stored in parallel file systems instead of locally on the cluster nodes. To increase reliability, the data is replicated, usually using some of the standard RAID schemes. Parallel file server nodes usually have more processing power than they need, so it is feasible to offload some of the data intensive processing to them. DRepl project replaces the standard methods of data replication with replicas having different layouts, optimized for the most commonly used access patterns. Replicas can be complete (i.e. any other replica can be reconstructed from it), or incomplete. DRepl consists of a language to describe the dataset and the necessary data layouts and tools to create a user-space file server that provides and keeps the data consistent and up to date in all optimized layouts.


  • GHTC12: Exploring employment opportunities through microtasks via cybercafes
    James Davis, Rajan Vaish

    Microwork in cybercafés is a promising tool for poverty alleviation. For those who cannot afford a computer, cybercafés can serve as a simple payment channel and as a platform to work. However, there are questions about whether workers are interested in working in cybercafés, whether cybercafé owners are willing to host such a set up, and whether workers are skilled enough to earn an acceptable pay rate? We designed experiments in internet/cyber cafes in India and Kenya to investigate these issues. We also investigated whether computers make workers more productive than mobile platforms? In surveys, we found that 99% of the users wanted to continue with the experiment in cybercafé, while 8 of 9 cybercafé owners showed interest to host this experiment. User typing speed was adequate to earn a pay rate comparable to their existing wages, and the fastest workers were approximately twice as productive using a computer platform.


  • MSST2012—On the Role of Burst Buffers in Leadership-Class Storage Systems
    Carlos Maltzahn, Adam Crume, Gary Grider

    28th IEEE Conference on Massive Data Storage (MSST 2012)


  • MSST2012—Valmar: High-Bandwidth Real-Time Streaming Data Management
    Scott Brandt, David Bigelow, John Bent, HB Chen

    28th IEEE Conference on Massive Data Storage (MSST 2012)


  • PacificVis2012—Analyzing the Evolution of Large Scale Structures in the Universe with Velocity Based Methods
    Alex Pang, Eddy Chandra, Uliana Popov, James Ahrens

    PacificVis 2012 : IEEE Pacific Visualization Symposium


  • QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
    Carlos Maltzahn, Sasha Ames

    File system metadata management has become a bottleneck for many data-intensive applications that rely on high-performance file systems. Part of the bottleneck is due to the limitations of an almost 50 year old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100 MB. Today's high-performance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to handle complex relationships among data. Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems. To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application. Results from our QMDS prototype show the effectiveness of this approach. Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.


  • RTAS12—handling small data elements and real-time indexing methods
    David Bigelow

    Paper Rejected


  • SIDR12: Efficient Structure-Aware Intelligent Data Routing in SciHadoop
    Neoklis Polyzotis, Scott Brandt, Carlos Maltzahn, Joe Buck, Noah Watkins, Greg Levin, Adam Crume, Kleoni Ioannidou
  • SIGIR12: Summarizing Highly Structured Documents for Effective Search Interaction
    Yi Zhang, Yunfei Chen, Lanbo Zhang

    As highly structured documents with rich metadata (such as products, movies, etc.) become increasingly prevalent, searching those documents has become an important IR problem. Unfortunately existing work on document summarization, especially in the context of search, has been mainly focused on unstructured documents, and little attention has been paid to highly structured documents. Due to the different characteristics of structured and unstructured documents, the ideal approaches for document summarization
    might be different. In this paper, we study the problem of summarizing highly structured documents in a
    search context. We propose a new summarization approach based on query-specific facet selection. Our approach aims to discover the important facets hidden behind a query using a machine learning approach, and summarizes retrieved documents based on those important facets. In addition, we propose to evaluate summarization approaches based on a utility function that measures how well the summaries assist users in interacting with the search results. Furthermore, we develop a game on Mechanical Turk to evaluate different
    summarization approaches. The experimental results show that the new summarization approach significantly outperforms two existing ones.


  • UCSC-SOE-12-07: DataMods: Programmable File System Services
    Scott Brandt, Carlos Maltzahn, Noah Watkins, Adam Manzanares

    2011 Papers

  • CIKM11—On Bias Problem in Relevance Feedback
    Yi Zhang, Lanbo Zhang

    Relevance feedback is an effective approach to improve retrieval quality over the initial query. Typical relevance feedback methods usually select top-ranked documents for relevance judgments, then query expansion or model updating are carried out based on the feedback documents. However, the number of feedback documents is usually limited due to expensive human labeling. Thus relevant documents in the feedback set are hardly representative of all relevant documents and the feedback set is actually biased. As a result, the performance of relevance feedback will get hurt. In this paper, we first show how and where the bias problem exists through experiments. Then we study how the bias can be reduced by utilizing the unlabeled documents. After analyzing the usefulness of a document to relevance feedback, we propose an approach that extends the feedback set with carefully selected unlabeled documents by heuristics. Our experiment results show that the extended feedback set has less bias than the original feedback set and better performance can be achieved when the extended feedback set is used for relevance feedback.

    In this paper, we present a flexible Bayesian hierarchical modeling approach to model both commonality and diversity among users as well as individual users' multiple interests. We propose two models each with different assumptions, and the proposed models are called Discriminative Factored Prior Models (DFPM). In our models, each user profile is modeled as a discriminative classifier with a factored model as its prior, and different factors contribute in different levels to each user profile. Compared with existing content-based filtering models, DFPM are interesting because they can 1) borrow discriminative criteria of other users while learning a particular user profile through the factored prior; 2) trade off well between diversity and commonality among users; and 3) handle the challenging classification situation where each class contains multiple concepts. The experimental results on a dataset collected from real users on digg.com show that our models significantly outperform the baseline models of L-2 regularized logistic regression and the standard Bayesian hierarchical model with logistic regression.


  • DRepl: Optimizing Access to Application Data for Analysis and Visualization
    Carlos Maltzahn, Michael Lang, Latchesar Ionkov

    Until recently most scientific applications produced data that is saved, analyzed and visualized at later time.
    In recent years, with the large increase in the amount of data and computational power available there is demand for applications to support data access in-situ, or close-to simulation to provide application steering, analytics and visualization. Data access patterns required for these activities are usually different than the data layout produced by the application. In most of the large HPC clusters scientific data is stored in parallel file systems instead of locally on the cluster nodes. To increase reliability, the data is replicated, usually using some of the standard RAID schemes. Parallel file server nodes usually have more processing power than they need, so it is feasible to offload some of the data intensive processing to them. DRepl project replaces the standard methods of data replication with replicas having different layouts, optimized for the most commonly used access patterns. Replicas can be complete (i.e. any other replica can be reconstructed from it), or incomplete. DRepl consists of a language to describe the dataset and the necessary data layouts and tools to create a user-space file server that provides and keeps the data consistent and up to date in all optimized layouts.


  • Gaussian process modeling of derivative curves
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, David Higdon, Salman Habib, Katrin Heitmann, Ujjaini Alam
  • HCOMP'11—CrowdSight: Rapidly Prototyping Intelligent Visual Processing Apps
    James Davis, Mario Rodriguez, Reid Porter
  • ICIP2011—Eye Tracking Based Saliency for Automatic Content Aware Image Processing
    James Davis, Steve Scher

    Photography provides tangible and visceral mementos of important experiences. Recent research in content-aware image processing to automatically improve photos relies heavily on automatically identifying salient areas in images. While automatic saliency estimation has achieved estimable success, it will always face inherent challenges. Tracking the photographer’s eyes allows a direct, passive means to estimate scene saliency. We show that saliency estimation is sometimes an ill-posed posed problem for automatic algorithms, made wellposed by the availability of recorded eye tracks. We instrument several content-aware image processing algorithms with eye track based saliency estimation, producing photos that accentuate the parts of the image originally viewed.


  • ICS2011—On the Role of NVRAM in Data Intensive HPC Architectures
    Maya Gokhale, Sasha Ames
  • NAS2011—QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language
    Carlos Maltzahn, Sasha Ames

    File system metadata management has become a bottleneck for many data-intensive applications that rely on high-performance file systems. Part of the bottleneck is due to the limitations of an almost 50 year old interface standard with metadata abstractions that were designed at a time when high-end file systems managed less than 100 MB. Today's high-performance file systems store 7 to 9 orders of magnitude more data, resulting in numbers of data items for which these metadata abstractions are inadequate, such as directory hierarchies unable to handle complex relationships among data. Users of file systems have attempted to work around these inadequacies by moving application-specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management problems. To address this problem, we propose QMDS: a file system metadata management service that integrates all file system metadata and uses a graph data model with attributes on nodes and edges. Our service uses a query language interface for file identification and attribute retrieval. We present our metadata management service design and architecture and study its performance using a text analysis benchmark application. Results from our QMDS prototype show the effectiveness of this approach. Compared to the use of a file system and relational database, the QMDS prototype shows superior performance for both ingest and query workloads.


  • Nonparametric Reconstruction of the Dark Energy Equation of State from Diverse Data Sets
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, David Higdon, Salman Habib, Katrin Heitmann, Ujjaini Alam

    The cause of the accelerated expansion of the Universe poses one of the most fundamental ques-
    tions in physics today. In the absence of a compelling theory to explain the observations, a first task
    is to develop a robust phenomenology. If the acceleration is driven by some form of dark energy,
    then, the phenomenology is determined by the dark energy equation of state w. A major aim of on-
    going and upcoming cosmological surveys is to measure w and its time dependence at high accuracy.
    Since w(z) is not directly accessible to measurement, powerful reconstruction methods are needed to
    extract it reliably from observations. We have recently introduced a new reconstruction method for
    w(z) based on Gaussian process modeling. This method can capture nontrivial time-dependences
    in w(z) and, most importantly, it yields controlled and unbaised error estimates. In this paper we
    extend the method to include a diverse set of measurements: baryon acoustic oscillations, cosmic
    microwave background measurements, and supernova data. We analyze currently available datasets
    and present the resulting constraints on w(z), finding that current observations are in very good
    agreement with a cosmological constant. In addition we explore how well our method captures
    nontrivial behavior of w(z) by analyzing simulated data assuming high-quality observations from
    future surveys. We find that the baryon acoustic oscillation measurements by themselves already
    lead to remarkably good reconstruction results and that the combination of different high-quality
    probes allows us to reconstruct w(z) very reliably with small error bounds.


  • Physical Review D: Nonparametric Reconstruction of the Dark Energy Equation of State from Diverse Datasets
    Herbie Lee, Bruno Sanso, Tracy Holsclaw
  • REDfISh - REsilient Dynamic dIstributed Scalable System Services for Exascale
    Latchesar Ionkov, Sean Blanchard, Hugh Greenberg, Michael Lang

    A dramatic change is needed for system services to address the challenges of exascale. Due to the constraints required to build an exascale class system, it is imperative that system services be designed differently than they are today. Services need to be resilient, dynamic, distributed, and scalable. That is, they must respond and recover from failures; be selfhealing; recruit and relinquish helper cores based on demand;
    function without access to global system state, which is too large and too fluid for one process to contain; and scale arbitrarily by exploiting hierarchical domains of peers. To address these requirements for future system services, we present a DHCP replacement and compare it to existing DHCP. We show that dynamic allocation of services and the ability to absorb errors makes our approach superior to standard services. We then describe a novel path to creating exascale-ready services by focusing on the key tenets of resilience, dynamic adaption, fully distributed processes and scalability.


  • RTAS11—On the Role of NVRAM in Data Intensive HPC Architectures
    Scott Brandt, Carlos Maltzahn, Kleoni Ioannidou, Roberto Pineiro
  • RTAS2012—HBDM: High-Bandwidth Real-Time Streaming Data Management
    Scott Brandt, David Bigelow, Meghan McClelland, HB Chen, John Bent, James Nunez

    Paper submission rejected.


  • SC11—SciHadoop: Array-based Query Processing in Hadoop
    Neoklis Polyzotis, Scott Brandt, Carlos Maltzahn, Kleoni Ioannidou, Jeff LeFevre, Noah Watkins, Joe Buck

    Hadoop has become the de-facto standard platform for large-scale analysis in commercial applications and increasingly also in scientific applications. However, applying Hadoop’s byte stream data model to scientific data that is commonly stored according to highly-structured, abstract data models causes a number of inefficiencies that significantly limits the scalability of Hadoop applications in science. In this paper we introduce SciHadoop, a modification of Hadoop which allows scientists to specify abstract queries using a logical, array-based data model and which executes these queries as map/reduce programs defined on the logical data model. We describe the implementation of a SciHadoop pro- totype and use it to quantify the performance of three lev- els of accumulative optimizations over the Hadoop baseline where a NetCDF data set is managed in a default Hadoop configuration: the first optimization avoids remote reads by subdividing the input space of mappers on the logical level and instantiates mapper tasks with subqueries against the logical data model, the second optimization avoids full file scans by taking advantage of metadata available in the scientific data, and the third optimization further minimizes data transfers by pulling holistic functions (i.e. functions that cannot compute partial results) to mappers whenever possible.


  • SciHadoop: Array-based Query Processing in Hadoop
    Neoklis Polyzotis, Scott Brandt, Carlos Maltzahn, Joe Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou

    Hadoop has become the de-facto standard platform for large-scale analysis in commercial applications and increasingly also in scientific applications. However, applying Hadoop’s byte stream data model to scientific data that is commonly stored according to highly-structured, abstract data models causes a number of inefficiencies that significantly limits the scalability of Hadoop applications in science. In this paper we introduce SciHadoop, a modification of Hadoop which allows scientists to specify abstract queries using a logical, array-based data model and which executes these queries as map/reduce programs defined on the logical data model. We describe the implementation of a SciHadoop pro- totype and use it to quantify the performance of three lev- els of accumulative optimizations over the Hadoop baseline where a NetCDF data set is managed in a default Hadoop configuration: the first optimization avoids remote reads by subdividing the input space of mappers on the logical level and instantiates mapper tasks with subqueries against the logical data model, the second optimization avoids full file scans by taking advantage of metadata available in the scientific data, and the third optimization further minimizes data transfers by pulling holistic functions (i.e. functions that cannot compute partial results) to mappers whenever possible.


  • SIGGRAPH2011—Printing Reflectance Functions
    James Davis, Adam Crume, Steve Scher
  • SIGIR11: Filtering Semi-Structured Documents Bases on Faceted Feedback
    Yi Zhang, Lanbo Zhang

    Existing adaptive ltering systems learn user pro les based on users' relevance judgments on documents. In some cases, users have some prior knowledge about what features are important for a document to be relevant. For example, a Spanish speaker may only want news written in Spanish, and thus a relevant document should contain the feature \Language: Spanish"; a researcher working on HIV knows an article with the medical subject \MeSH1: AIDS" is very likely to be interesting to him/her.

    Semi-structured documents with rich faceted metadata are increasingly prevalent over the Internet. Motivated by the commonly used faceted search interface in e-commerce, we study whether users' prior knowledge about faceted features could be exploited for ltering semi-structured documents. We envision two faceted feedback solicitation mechanisms, and propose a novel user pro le learning algorithm that can incorporate user feedback on features. To evaluate the proposed work, we use two data sets from the TREC ltering track, and conduct a user study on Amazon Mechanical Turk. Our experimental results show that user feedback on faceted features is useful for ltering. The new user pro le learning algorithm can e ectively learn from user feedback on faceted features and performs better than several other methods adapted from the feature-based feedback techniques proposed for retrieval and text classi cation tasks in previous work.


  • SIGMOD 2011—On-line Index Selection for Physical Database Tuning
    Neoklis Polyzotis, Karl Schnaitter

    SIGMOD Jim Gray Doctoral Dissertation Honorable Mention


  • SIGMOD 2012—Divergent Physical Design Tuning
    Neoklis Polyzotis, Jeff LeFevre, Kleoni Ioannidou
  • Trechnometrics: Gaussian Process Modeling of Derivative Curves
    Herbie Lee, Bruno Sanso, Tracy Holsclaw
  • UCSC-SOE-11-02: Gaussian Process Modeling of Derivative Curves
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, Salman Habib, Ujjaini Alam, Katrin Heitmann, David Higdon

    Gaussian process (GP) models provide non-parametric methods to fit continuous curves observed with noise. In this paper, we develop a GP based inverse method that allows for the estimation of the derivative of a curve, avoiding direct estimation from the data. A GP model can be fit to the data directly, then the derivatives obtained by means of differentiation of the correlation function. However, it is known that this approach can be inadequate due to loss of information when differentiating. We present a new method of obtaining the derivative process by viewing this as an inverse problem. We use the properties of a GP to obtain a computationally efficient fit. We illustrate our method with simulated data as well as with an important cosmological application. We include a discussion on model comparison techniques for assessing the fit of this alternative method.


  • UCSC-SOE-11-08: RAD-FETCH: Modeling Prefetching for Hard Real-Time Tasks
    Scott Brandt, Carlos Maltzahn, Kleoni Ioannidou, Roberto Pineiro

    Real–time systems and applications are becoming increasingly complex and larger, often requiring to process more data that what could be fitted in main memory. The management of the individual tasks is well-understood, but the interaction of communicating tasks with different timing characteristics is less well-understood. We discuss how to model prefetching across a series of real–time tasks/components communicating flows via reserved memory buffers (possibly interconnected via a real–time network) and present RAD– FETCH, a model for characterizing and managing these interactions. We provide proofs demonstrating the correctness RAD–FETCH, allowing system designers to determine the amount of memory required and latency bounds based upon the characteristics of the interacting real–time tasks and of the system as a whole.


  • UCSC-SOE-11-09: SNS: A Simple Model for Understanding Optimal Hard
    Scott Brandt, Greg Levin, Ian Pye, Caitlin Sadowski

    We consider the problem of optimal real-time scheduling of periodic tasks for multiprocessors. A number of recent papers have used the notion of fluid scheduling to guarantee optimality and improve performance. In this paper, we examine exactly how fluid scheduling techniques overcome inherent problems faced by greedy scheduling algorithms such as EDF. Based on these foundations, we describe a simple and clear optimal scheduling algorithm which serves as a least common ancestor to other recent algorithms. We provide a survey of the various fluid scheduling algorithms in this novel context.


  • UCSC-SOE-11-11: Nonparametric Reconstruction of the Dark Energy Equation of State from Diverse Data Sets
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, Ujjaini Alam, Katrin Heitmann, Salman Habib, David Higdon

    The cause for the accelerated expansion of the Universe poses one of the most fundamental questions in physics today. In the absence of a compelling theory to explain the observations, we first have to characterize the phenomenon. If we assume that the acceleration is caused by some form of dark energy, it can be described by its dark energy equation of state w. It is a major aim of ongoing and upcoming cosmological surveys to measure the dark energy equation of state and a possible time dependence at high accuracy. Since we cannot measure w(z) directly, we have to develop powerful reconstruction methods to extract w(z) reliably with controlled error bars. We have recently introduced a new reconstruction method for w(z) based on Gaussian process modeling. This method can capture non-trivial time-dependencies in w(z) and most importantly yields controlled and unbaised error estimates. In this paper we extend the method to include a diverse set of measurements: baryon acoustic oscillations, cosmic microwave background measurements, and supernova data. We analyze currently available data and show constraints on w(z). We find that current observations are in very good agreement with a cosmological constant. In addition we explore how well our method captures nontrivial behavior of w(z) by analyzing simulated data assuming high-quality data from future surveys. We find that baryon acoustic oscillation measurements by themselves already lead to remarkably good reconstruction results and the combination of different high-quality probes allows us to reconstruct w(z) very reliably with small error bounds.


  • USENIX FAST12—RAID4S Supercharging RAID Small Writes with SSD
    Scott Brandt, Carlos Maltzahn, Rosie Wacha, John Bent
  • VIS2011: The Evolution of Multistreaming Events in the Formation of Large Scale Structures
    Alex Pang, Uliana Popov, Salman Habib, James Ahrens, Katrin Heitmann

    This paper describes the analysis and application of visualization techniques to identify, track, and characterize multistreaming events and flows in the evolution of the Universe. Multistreaming is associated with the formation of visually striking large scale structure (LSS) comprised of elements such as halos, filaments, and sheets which have been theoretically predicted and observed in cosmological surveys. Many aspects in LSS theory still remain to be understood; it is therefore of great interest to study the role of multistreaming in the formation and evolution of cosmic structure. This problem is now being attacked with the aid of high accuracy cosmological simulations. In this paper, we describe new methods of identifying multistreaming regions based on various velocity based feature extractors and perform particle and region tracking of multistreaming events. We find that incorporating particle velocity information in the analysis reveals new insights about the evolution of LSS.

    Index Terms—Cosmology, multistreaming, feature detection, particle tracking, region tracking, velocity field.


  • VLDB2011—CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads
    Neoklis Polyzotis

    Index tuning, i.e., selecting the indexes appropriate for a workload, is a crucial problem in database system tuning. In this paper, we solve index tuning for large problem instances that are common in practice, e.g., thousands of queries in the workload, thousands of candidate indexes and several hard and soft constraints. Our work is the first to reveal that the index tuning problem has a well structured space of solutions, and this space can be explored efficiently with well known techniques from linear optimization. Experimental results demonstrate that our approach outperforms state-of-the-art commercial and research techniques by a significant margin (up to an order of magnitude).


  • VLDB2011—Eirene: Interactive Design and Refinement of Schema Mappings via Data Examples
    Wang-Chiew Tan

    One of the first steps in the process of integrating information from multiple sources into a desired target format is to specify the relationships, called schema mappings, between the source schemas and the target schema. In this demonstration, we showcase a new methodology for designing schema mappings. Our system Eirene interactively solicits data examples from the mapping designer in order to design a schema mapping between a source schema and a target schema. A data example, in this context, is a pair consisting of a source instance and a target instance showing the desired outcome of performing data exchange using the schema mapping being designed. One of the central parts of the system is a module that, given a set of data examples, either returns a “best” fitting schema mapping, or reports that no fitting schema mapping exists.


    2010 Papers

  • CIKM10—Discriminative Factored Prior Model for Personalized Content-Based Recommendation
    Yi Zhang, Lanbo Zhang

    Most existing content-based filtering approaches including Rocchio, Language Models, SVM, Logistic Regression, Neural Networks, etc. learn user profiles independently without capturing the similarity among users. The Bayesian hierarchical models learn user profiles jointly and have the advantage of being able to borrow information from other users through a Bayesian prior. The standard Bayesian hierarchical model assumes all user profiles are generated from the same prior. However, considering the diversity of user interests, this assumption might not be optimal. Besides, most existing content-based filtering approaches implicitly assume that each user profile corresponds to exactly one user interest and fail to capture a user's multiple interests (information needs).

    In this paper, we present a flexible Bayesian hierarchical modeling approach to model both commonality and diversity among users as well as individual users' multiple interests. We propose two models each with different assumptions, and the proposed models are called Discriminative Factored Prior Models (DFPM). In our models, each user profile is modeled as a discriminative classifier with a factored model as its prior, and different factors contribute in different levels to each user profile. Compared with existing content-based filtering models, DFPM are interesting because they can 1) borrow discriminative criteria of other users while learning a particular user profile through the factored prior; 2) trade off well between diversity and commonality among users; and 3) handle the challenging classification situation where each class contains multiple concepts. The experimental results on a dataset collected from real users on digg.com show that our models significantly outperform the baseline models of L-2 regularized logistic regression and the standard Bayesian hierarchical model with logistic regression.


  • ECRTS 2010: DP-FAIR: A Simple Model for Understanding Optimal Multiprocessor Scheduling
    Scott Brandt, Ian Pye

    We consider the problem of optimal real-time scheduling of periodic and sporadic tasks for identical multiprocessors. A number of recent papers have used the notions of fluid scheduling and deadline partitioning to guarantee optimality and improve performance. In this paper, we develop a unifying theory with the DP-FAIR scheduling policy and examine how it overcomes problems faced by greedy scheduling algorithms. We then present a simple DP-FAIR scheduling algorithm, DP-WRAP, which serves as a least common ancestor to many recent algorithms. We also show how to extend DP-FAIR to the scheduling of sporadic tasks with arbitrary deadlines.

    Euromicro Conference on Real-time Systems (ECRTS), 2010. ECRTS Best Paper Award.  ISBN: 978-0-7695-4111-2


  • EuroSys 2010—RAID4S: Adding SSDs to RAID Arrays
    Scott Brandt, Carlos Maltzahn, Rosie Wacha, John Bent
  • FAST10—InfoGarden: A Casual-Game Approach to Digital Archive Management
    Carlos Maltzahn, Michael Mateas, Jim Whitehead
  • Mahanaxar: Quality of Service Guarantees in High-Bandwidth, Real-Time Streaming Data Storage
    Scott Brandt, David Bigelow, HB Chen, John Bent

    Large radio telescopes, cyber-security systems monitoring real-time network traffic, and others have specialized data storage needs: guaranteed capture of an ultra-high-bandwidth data stream, retention of the data long enough to determine what is “interesting,” retention of interesting data indefinitely, and concurrent read/write access to determine what data is interesting, without interrupting the ongoing capture of incoming data. Mahanaxar addresses this problem. Mahanaxar guarantees streaming real-time data capture at (nearly) the full rate of the raw device, allows concurrent read and write access to the device on a best-effort basis without interrupting the data capture, and retains data as long as possible given the available storage. It has built in mechanisms for reliability and indexing, can scale to meet arbitrary bandwidth requirements, and handles both small and large data elements equally well. Results from our prototype implementation shows that Mahanaxar provides both better guarantees and better performance than traditional file systems.


  • Nonparametric Dark Energy Reconstruction from Supernova Data
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, Ujjaini Alam, Katrin Heitmann, Salman Habib, David Higdon

    Understanding the origin of the accelerated expansion of the Universe poses one of the greatest challenges in physics today. Lacking a compelling fundamental theory to test, observational efforts are targeted at a better characterization of the underlying cause. If a new form of mass-energy, dark energy, is driving the acceleration, the redshift evolution of the equation of state parameter w(z) will hold essential clues as to its origin. To best exploit data from observations it is necessary to develop a robust and accurate reconstruction approach, with controlled errors, for w(z). We introduce a new, nonparametric method for solving the associated statistical inverse problem based on Gaussian Process modeling and Markov chain Monte Carlo sampling. Applying this method to recent supernova measurements, we reconstruct the continuous history of w out to redshift z=1.5.


  • Nonparametric Reconstruction of the Dark Energy Equation of State
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, David Higdon, Ujjaini Alam, Salman Habib, Katrin Heitmann

    A basic aim of ongoing and upcoming cosmological surveys is to unravel the mystery of dark energy. In the absence of a compelling theory to test, a natural approach is to better characterize the properties of dark energy in search of clues that can lead to a more fundamental understanding. One way to view this characterization is the improved determination of the redshift-dependence of the dark energy equation of state parameter, w(z). To do this requires a robust and bias-free method for reconstructing w(z) from data that does not rely on restrictive expansion schemes or assumed functional forms for w(z). We present a new nonparametric reconstruction method that solves for w(z) as a statistical inverse problem, based on a Gaussian process representation. This method reliably captures nontrivial behavior of w(z) and provides controlled error bounds. We demonstrate the power of the method on different sets of simulated supernova data; the approach can be easily extended to include diverse cosmological probes.


  • Nonparametric Reconstruction of the Dark Energy Equation of State
    Herbie Lee, Bruno Sanso, Tracy Holsclaw, David Higdon, Katrin Heitmann, Salman Habib, Ujjaini Alam

    A basic aim of ongoing and upcoming cosmological surveys is to unravel the mystery of dark energy. In the absence of a compelling theory to test, a natural approach is to better characterize the properties of dark energy in search of clues that can lead to a more fundamental understanding. One way to view this characterization is the improved determination of the redshift-dependence of the dark energy equation of state parameter, w(z). To do this requires a robust and bias-free method for reconstructing w(z) from data that does not rely on restrictive expansion schemes or assumed functional forms for w(z). We present a new nonparametric reconstruction method that solves for w(z) as a statistical inverse problem, based on a Gaussian Process representation. This method reliably captures nontrivial behavior of w(z) and provides controlled error bounds. We demonstrate the power of the method on different sets of simulated supernova data; the approach can be easily extended to include diverse cosmological probes.

    Comments: 16 pages, 11 figures, accepted for publication in Physical Review D
    Subjects: Cosmology and Extragalactic Astrophysics (astro-ph.CO)
    Journal reference: Phys.Rev.D82:103502,2010
    DOI: 10.1103/PhysRevD.82.103502
    Report number: LA-UR-09-05888
    Cite as: arXiv:1009.5443v1 [astro-ph.CO]

  • PAN 2010: Detecting Wikipedia Vandalism using WikiTrust
    Luca de Alfaro , Bo Adler, Ian Pye

    WikiTrust is a reputation system for Wikipedia authors and content. WikiTrust computes three main quantities: edit quality, author reputation, and content reputation. The edit quality measures how well each edit, that is, each change introduced in a revision, is preserved in subsequent revisions. Authors who perform good quality edits gain reputation, and text which is revised by several high-reputation authors gains reputation. Since vandalism on the Wikipedia is usually performed by anonymous or new users (not least because long-timevandals end up banned), and is usually reverted in a reasonably short span of time, edit quality, author reputation, and content reputation are obvious candidates as features to identify vandalism on the Wikipedia. Indeed, using the full set of features computed by WikiTrust, we have been able to construct classifiers that identify vandalism with a recall of 83.5%, a precision of 48.5%, and a false positive rate of 8%, for an area under the ROC curve of 93.4%. If we limit ourselves to the set of features available at the time an edit is made (when the edit quality is still unknown), the classifier achieves a recall of 77.1%, a precision of 36.9%, and a false positive rate of 12.2%, for an area under the ROC curve of 90.4%.

    Using these classifiers, we have implemented a simple Web API that provides the vandalism estimate for every revision of the English Wikipedia. The API can be used both to identify vandalism that needs to be reverted, and to select highquality, non-vandalized recent revisions of any given Wikipedia article. These
    recent high-quality revisions can be included in static snapshots of the Wikipedia, or they can be used whenever tolerance to vandalism is low (as in a school setting, or whenever the material is widely disseminated).


  • SIGIR10—Interactive Retrieval Based on Faceted Feedback
    Yi Zhang, Lanbo Zhang

    Motivated by the commonly used faceted search interface in e-commerce, this paper investigates interactive relevance feedback mechanism based on faceted document metadata. In this mechanism, the system recommends a group of document facet-value pairs, and lets users select relevant ones to restrict the returned documents. We propose four facet-value pair recommendation approaches and two retrieval models that incorporate user feedback on document facets. Evaluated based on user feedback collected through Amazon Mechanical Turk, our experimental results show that the Boolean filtering approach, which is widely used in faceted search in e-commerce, doesn't work well for text document retrieval, due to the incompleteness (low recall) of metadata assignment in semi-structured text documents. Instead, a soft model performs more effectively. The faceted feedback mechanism can also be combined with document-based relevance feedback and pseudo relevance feedback to further improve the retrieval performance.


  • SIGMOD/PODS 2010—An Automated, yet Interactive and Portable DB designer
    Neoklis Polyzotis, Karl Schnaitter

    Tuning tools attempt to configure a database to achieve optimal performance for a given workload. Selecting an optimal set of physical structures is computationally hard since it involves searching a vast space of possible configurations. Commercial DBMSs offer tools that can address this problem. The usefulness of such tools, however, is limited by their dependence on greedy heuristics, the need for a-priori (offline) knowledge of the workload, and lack of an optimal materialization schedule to get the best out of suggested design features. Moreover, the open source DBMSs do not provide any automated tuning tools. This demonstration introduces a comprehensive physical designer for the PostgreSQL open source DBMS. The tool suggests design features for both offline and online workloads. It provides close to optimal suggestions for indexes for a given workload by modeling the problem as a combinatorial optimization problem and solving it by sophisticated and mature solvers. It also determines the interaction between indexes to suggest an effective materialization strategy for the selected indexes. The tool is interactive as it allows the database administrator (DBA) to suggest a set of candidate features and shows their benefits and interactions visually. For the demonstration we use large realworld scientific datasets and query workloads.


  • SIGMOD/PODS 2010—Computing Query Probability with Incidence Algebras
    Karl Schnaitter

    We describe an algorithm that evaluates queries over probabilistic databases using Mobius’ inversion formula in incidence algebras. The queries we consider are unions of conjunctive queries (equivalently: existential, positive First Order sentences), and the probabilistic databases are tuple-independent structures. Our algorithm runs in PTIME on a subset of queries called ”safe” queries, and is complete, in the sense that every unsafe query is hard for the class FP#P . The algorithm is very simple and easy to implement in practice, yet it is non-obvious. Mobius’ inversion formula, which is in essence inclusion-exclusion, plays a key role for completeness, by allowing the algorithm to compute the probability of some safe queries even when they have some subqueries that are unsafe. We also apply the same lattice-theoretic techniques to analyze an algorithm based on lifted conditioning, and prove that it is incomplete.


  • The Scientist: Redesigning Scientific Reputation
    Luca de Alfaro , Bo Adler, Ian Pye

    2009 Papers

  • A Flexible Scheduling Framework Supporting Multiple Programming Models with Arbitrary Semantics in Linux
    Noah Watkins
  • A Benchmark for On-line Selection: SMDB 2009
    Neoklis Polyzotis, Karl Schnaitter

    Online approaches to physical design tuning have received considerable attention in the recent literature, with a focus on the problem of online index selection. However, it is difficult to draw conclusions on the relative merits of the proposed techniques, as they have been evaluated in isolation using different methodologies. In this paper, we make two concrete contributions to address this issue. First, we propose a benchmark for evaluating the performance of an online tuning algorithm in a principled fashion. Second, using the benchmark, we present a comparison of two representative online tuning algorithms that are implemented in the same database system. The results provide interesting insights on the behavior of these algorithms and validate the usefulness of the proposed benchmark.


  • A Metadata-Rich File System
    Carlos Maltzahn, Sasha Ames
  • Abstract Storage: Moving File Format - Specific Abstractions into Petabyte-Scale Storage Systems
    Scott Brandt, Carlos Maltzahn, Joe Buck, Noah Watkins

    Second International Workshop on Data-Aware Distributed Computing (in conjunction with HPDC-18), Munich, Germany, June 9, 2009.


  • BMVC 2009- Grammar-guided Feature Extraction for Location-Based Object Detectio
    David Helmbold, Damian Eads, Edward Rosten
  • Building a parallel file system simulator
    Scott Brandt, Carlos Maltzahn, Esteban Molina-Estolano, John Bent
  • Building a Parallel file System Simulator
    Scott Brandt, Carlos Maltzahn, Esteban Molina-Estolano, John Bent

    Parallel file systems are gaining in popularity in high-end computing centers as well as commercial data centers. High-end computing systems are expected to scale exponentially and to pose new challenges to their storage scalability in terms of cost and power. To address these challenges scientists and file system designers will need a thorough understanding of the design space of parallel file systems. Yet there exist few systematic studies of parallel file system behavior at petabyte- and exabyte scale. An important reason is the significant cost of getting access to large-scale hardware to test parallel file systems. To contribute to this understanding we are building a parallel file system simulator that can simulate parallel file systems at very large scale. Our goal is to simulate petabyte-scale parallel file systems on a small cluster or even a single machine in reasonable time and fidelity. With this simulator, file system experts will be able to tune existing file systems for specific workloads, scientists and file system deployment engineers will be able to better communicate workload requirements, file system designers and researchers will be able to try out design alternatives and innovations at scale, and instructors will be able to study very large-scale parallel file system behavior in the class room. In this paper we describe our approach and provide preliminary results that are encouraging both in terms of fidelity and simulation scalability.


  • Comparing the Performance of Different Parallel Filesystem Placement Strategies
    Scott Brandt, Carlos Maltzahn, Esteban Molina-Estolano, John Bent

    Work-in-Progress Session, Conference on File and Storage Technology (FAST), San Franciso, CA, February 24-27, 2009


  • Cosmic Calibration - Statistical Modeling for Dark Energy and the Cosmological Constants
    Tracy Holsclaw
  • Data Reliability Techniques for Specialized Storage Environments
    Scott Brandt, Rosie Wacha, James Nunez, John Bent, Gary Grider
  • Depth Estimation for Ranking Query Optimization
    Neoklis Polyzotis, Karl Schnaitter
  • Exploring Multistreaming in the Universe IEEE Vis 2009
    Alex Pang, Eddy Chandra
  • Fusing Data Management Services with File Systems
    Neoklis Polyzotis, Scott Brandt, Carlos Maltzahn, Wang-Chiew Tan

    File systems are the backbone of large-scale data processing for scientific applications. Motivated by the need to provide an extensible and flexible framework beyond the abstractions provided by API libraries for files to manage and analyze large-scale data, we are developing Damasc, an enhanced file system where rich data management services for scientific computing are provided as a native part of the file system.

    This paper presents our vision for Damasc, a performant file system that would allow scientists or even casual users to pose declarative queries and updates over views of underlying files that are stored in their native bytestream format. In Damasc, a configurable layer is added on top of the file system to expose the contents of files in a logical data model through which views can be defined and used for queries and updates. The logical data model and views are leveraged to optimize access to files through caching and selforganizing indexing. In addition, provenance capture and analysis to file access is also built into Damasc. We describe the salient features of our proposal and discuss how it can benefit the development of scientific code.

    --

    Accepted for publication at 4th Petascale Data Storage Workshop (PDSW 09), November
    15, 2009


  • Index Interactions in Physical Design Tuning: Modeling, Analysis, and Applications
    Neoklis Polyzotis, Lise Getoor, Karl Schnaitter
  • Learning Object Location Predictors with Boosting and Grammar-Guided Feature Extraction
    David Helmbold, Damian Eads
  • Material Classification with BRDF Slices Computer Vision and Pattern Recognition (CVPR)
    James Davis, Steve Scher
  • Measuring Contributions to Email-Based Discussion Groups
    Luca de Alfaro , Ian Pye

    Email-based discussion groups are a vast source of non-canonical crowd sourced information. However, due to their open nature (anyone can post), evaluating the quality of answers is challenging. In this work, we develop a framework for analyzing author contributions to email-based discussion groups.

    Sentiment analysis is the process of extracting the overall feeling from a body of text. We present a novel technique which applies sentiment analysis to evaluate the quality of answers. We present six novel algorithms and compare their results to a manually calculated baseline, two machine learning algorithms, and two algorithms based on link analysis. We find that by using sentiment analysis, our algorithms out perform both the machine learning and link analysis approaches in most experiments. We also find that a simple text-based approach without sentiment analysis is surprisingly powerful.


  • Mixing Hadoop and HPC Workloads on Parallel Filesystems
    Scott Brandt, Carlos Maltzahn, Esteban Molina-Estolano, John Bent

    MapReduce-tailored distributed filesystems—such as HDFS for Hadoop MapReduce—and parallel high-performance computing filesystems are tailored for considerably different workloads. The purpose of our work is to examine the performance of each filesystem when both sorts of workload run on it concurrently. We examine two workloads on two filesystems. For the HPC workload, we use the IOR checkpointing benchmark and the Parallel Virtual File System, Version 2 (PVFS); for Hadoop, we use an HTTP attack classifier and the CloudStore filesystem. We analyze the performance of each file system when it concurrently runs its “native” workload as well as the non-native workload.

    --

    Accepted for publication at 4th Petascale Data Storage Workshop (PDSW 09), November
    15, 2009


  • Optimized Image Sampling for View and Light Interpolation International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST)
    James Davis, Steve Scher
  • PostgreSQL (To appear as a chapter in Database System Concepts, 6th ed.)
    Karl Schnaitter
  • Semantic Web for Search
    Jessica Gronski
  • SMDB 2009 A Benchmark for Online Index Selection
    Neoklis Polyzotis, Karl Schnaitter
  • SNS: A Simple Model for Understanding Optimal Hard Real-Time Multiprocessor Scheduling
    Scott Brandt, Greg Levin, Ian Pye
  • TREC2009—UCSC at Relevance Feedback Track
    Yi Zhang, Lanbo Zhang
  • You Find What You’re Looking For: Tracking with Detailed Models Advancement to Candidacy
    James Davis, Steve Scher


Theses and Dissertations

Investigating Efficient Real-time Perfomance Guarantees on Storage Networks
Andrew Shewmaker, MS Thesis, March 2009
University Advisor: Darrell Long
LANL Mentors: Gary Grider, James Nunez, John Bent

On-line Index Selection for Physical Database Tuning
Karl Schnaitter, Ph.D. Dissertation, June 2010
University Advisor: Neoklis Polyzotis
LANL Mentors: Gary Grider, James Nunez, John Bent

Data Reliability Techniques for Specialized Storage Environments
Rosie Wacha, MS Thesis, December 2008
University Advisor: Scott Brandt
LANL Mentors: Gary Grider, James Nunez, John Bent

Exploring Multistreaming in the Universe
Eddy Chandra, MS Thesis, June 2009
University Advisor: Alex Pang
LANL Mentors: Katrin Heitmann, James Ahrens

Entropy Regularization and Soft Margin Maximization
Karen Glocer, Ph.D. Dissertation, December 2009
University Advisor: Manfred Warmuth
LANL Mentors: James Theiler, Simon Perkins

Quality-of-Service Issues in Storage Systems
Joel Wu, Ph.D. Dissertation, June 2009
University Advisor: Scott Brandt
LANL Mentors:

Information-driven Cooperative Sampling Strategies for Spatial Estimation by Robotic Sensor Networks
Rishi Graham, Ph.D. Dissertation, June 2010
University Advisor:
LANL Mentors: David Higdon, Katrin Heitmann, Salman Habib

Secure, Energy-Efficient, Evolvable, Long-Term Archival Storage
Mark Storer, Ph.D. Dissertation, February 2009
University Advisor: Ethan Miller
LANL Mentors:

Reliability and Power-Efficiency in Erasure-Coded Storage Systems
Kevin Greenan, Ph.D. Dissertation, December 2009
University Advisor: Ethan Miller
LANL Mentors:

Reliability Mechanisms for File Systems Using Non-Volatile Memory as a Metadata Store
Kevin Greenan, MS Thesis, March 2006
University Advisor: Ethan Miller
LANL Mentors:

Organizing, Indexing, and Searching Large-Scale File Systems
Andrew Leung, Ph.D. Dissertation, June 2009
University Advisor: Ethan Miller
LANL Mentors:

Mediium Access Control in Ad Hoc Networks with Omni-Directional Antennas
Caixue Lin, Ph.D. Dissertation, June 2006
University Advisor: Scott Brandt
LANL Mentors:

Predictable High Performance Data Management - Leveraging system resource characteristics to efficiently improve performance and predictability
Tim Kaldewey, Ph.D. Dissertation, March 2010
University Advisor: Scott Brandt
LANL Mentors:

Efficient Guaranteed Disk I/O Performance Management
Anna Povzner, Ph.D. Dissertation, June 2010
University Advisor: Scott Brandt
LANL Mentors:

Ringer: Distributed Naming on a Global Scale
Ian Pye, MS Thesis, December 2008
University Advisor: Luca de Alfaro
LANL Mentors: Shelly Spearing, Jorge Roman

Efficient Performance Guarantees on Storage Networks
Andrew Shewmaker, Proposed Ph.D. Dissertation, December 2010
University Advisor: Scott Brandt
LANL Mentors: Gary Grider, James Nunez, John Bent

Xquery Over Scientific Data Formats
Richa Khandelwal, MS Thesis, June 2010
University Advisor: Scott Brandt
LANL Mentors: Gary Grider, John Bent, James Ahrens, Carolyn Connor

Boosting in Location Space
Damian Eads, Ph.D. Dissertation, May 2011
University Advisor: David Helmbold
LANL Mentors: James Theiler

Statistical Modeling for Dark Energy and Associated Cosmological Constants
Tracy Holsclaw, Ph.D. Dissertation, May 2011
University Advisor: Herbie Lee
LANL Mentors: David Higdon, Katrin Heitmann

Managing High-Bandwidth Real-Time Streaming Data
David Bigelow, MS Thesis, May 2011
University Advisor: Scott Brandt
LANL Mentors: Gary Grider, James Nunez, John Bent

Defense: Management of High-Volume Real-Time Streaming Data in Transient Environments
David Bigelow, Ph.D. Dissertation, August 2012
University Advisor: Scott Brandt
LANL Mentors: Gary Grider, John Bent, HB Chen, Sarah Michalak

Defense: Modthresh Improvements Over Standard RAID4S
Rosie Wacha, Ph.D. Dissertation, December 2012
University Advisor: Scott Brandt
LANL Mentors: Gary Grider, James Nunez, John Bent, Meghan McClelland

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Inside | © Copyright 2008-09 Los Alamos National Security, LLC All rights reserved | Disclaimer/Privacy | Web Contact