Image and signal sensors collect data at increasingly greater rates, making it a great challenge to search and organize data in archives. Work is underway to engage the sensor, data management, and machine learning expertise of LANL and UCSC to tackle adaptive, content-based search in large remote sensing archives. We demonstrate the utility of a new method for extracting features from imagery and signals to aid the archival search problem.
CASCC is a new algorithm for classifying time series. It is highly competitive in terms of speed and accuracy compared to many other algorithms. It is inspired by another leading alghorthm DTW-1NN howerver does not suffer the same computational limitations when applying the model to new time series.
RAID systems have traditionally offered increased performance and data security in small storage systems. An opportunitynow exists to extend traditional RAID principles into the area of large-scale object-based storage devices in order to offer greater data security and space efficiency. In a system where component failures can be expected on a daily basis, the importance of redundancy mechanisms is obvious, and RAID principles offer an appropriate model. Ceph is an excellent platform with whic
We propose to do fundamental research in the development of self-organizing configure automatically its physical schema, in an on-line fashion.
Pseudorandom placement in distributed storage systems offers scalability benefits. Pseudorandom placement makes load balancing harder; new techniques are required. We explore different load balancing techniques using Ceph, an object-based storage system developed at UCSC.
We propose to work on the problem of calibrating the parameters of computer code used for simulation of physical phenomena. We will explore statistical methods based on a Bayesian approach implemented with Sampling Importance Resampling (SIR).
The ISIS team in ISR-2 primarily uses supervised learning techniques to solve classification problems in imagery and therefore has a strong interest in finding linear classification algorithms that are both robust and efficient. Boosting algorithms take a principled approach to finding linear classifiers and they has been shown to be so effective in practice that they are widely used in a variety of domains. In this proposal we present evidence that smoothing is not necessarily the optimal way
The size of the data sets and the uncertainty in the data sets come from the fact that we are dealing with ensemble data sets. These are usually from Monte Carlo simulations where each output (out of many runs) represents a possible solution. The degree of agreement (or disagreement) provides some indications of certainty (or uncertainty) about the results. Because Monte Carlo simulations can potentially involve large number of repetitions, the total data size can very quickly get very large. This project will explore uncertainty and how visualization can be used as a tool to help deal with it.
The research objective of this proposal is to measure human body shape and motion without augmenting the subject. The hypothesis is that replacing traditional cameras with high accuracy 3D shape measurement devices and utilizing a carefully constructed prior model of human surface shape are the critical factors that have been missing from prior attempts to meet this goal. The long term accuracy targets are shape to 1mm and motion to 1deg.
To help users make sense of collaboratively-generated information, we are developing algorithmic notions of information trust
We focus on developing intelligent interactive search and browsing techniques to help users and the information they are looking for from billions of non-relevant files
In large tightly coupled parallel systems, computation goes as fast as the slowest part. For this reason it is necessary to pursue deterministic behavior of all parts of the system. Quality of Service is one way to assist in providing deterministic behavior. This project will explore providing Quality of Service on networks of interest to high performance computing.
We are planning to develop a web-based system to help 'decision makers' quickly identify and process relevant web-based information in case of a disease outbreak. We will work on identifying the pathogen based on sequence information. We will also develop an adaptive information filtering to find, filter and condense the information available on the web.
This project explores using Open Speed Shop Frameworks Plugins to provide scalable memory usage and correctness tools.
This project explores how well suited data intensive computing programming/run time paradigms like map reduce and other graphs apply to scientific applications.
The concept is to head HPC file system storage towards file formats in the file system. It is quite possible that many file types like N processes to 1 file with small strided writes might be served well by special handling at the file system level. Decades ago, file types and access methods were used and were supported within a single file system. The IBM MVS storage systems allowed for many different file types, partitioned data sets, indexed sequential, virtual sequential, and sequential to name a few. Storage for modern HPC systems may benefit from a new parallel/scalable version of file types. There is much research to be done in this area to determine the usefulness of this concept and how such a thing would work with modern supercomputers and future HPC languages and operating environments.
Data users need effective means to locate their files in huge and increasing scaled out file systems containing millions or even billions of files. Use of simple directories is not sufficient for managing this number of files. Users already have search tools for file systems but they are based on directory structure oriented searches. We propose to explore how indexing of metadata about files could provide users more utility for searching for files in extremely large parallel file systems. We will concentrate on the interfaces users would use to search indexed metadata about files.
Storage system design and optimization require usage data in order to determine where in the system to focus resources. Typically this usage data is created by observing the response of the system while running synthetic benchmarks designed to place high stress on particular aspects of the system. The problem with many benchmarks is that they encourage improvements to areas of the system that do not improve actual user performance on typical workloads. Ideally, the benchmarks should be similar to actual applications so improvements can much more easily be made to storage systems. User applications often are proprietary or secure so using user applications for benchmarks is often not practical. This project will explore how synthetic benchmark applications can be synthetically generated using traces of real parallel user applications.
In large scale storage and file systems, multiple parallel applications can be asking for service simultaneously. In parallel workloads it is vital that deterministic service be given as the application only proceeds as fast as the slowest process. With mixed workloads this is very difficult to do in a parallel setting. We will explore how QoS support could be added to portions of the I/O stack in object based parallel file systems to enable determinism for multiple parallel applications in a mixed parallel workload environment.
The Cell is powerful, but challenging to program. Concurrent Programming Models for the Cell Broadband Engine may help. Everything old is new again, Cilk can provide a smooth transition of older C codes to the Cell.
One of the major challenges facing a globally distributed Enterprise is storing and retrieving documents. This task requires features from distributed storage systems, databases and the Internet. However, because of the unique nature of the documents to be stored and the access pattern of the documents, none of these alone suffice to provide truly reliable and searchable distributed document management. We present Ringer – a overlay network which combines elements of peerto-peer storage, distributed query processing and indexing to create an Internet scale content distribution and retrieval network.
Large scale storage systems often store massive data sets that are highly dimensioned and massive numbers of files in directory structures. This project explores solutions to dealing with indexing and managing these data sets.