Turning up the volume on big data
Scale, immediacy, & continuous improvement
The datacenter as a computer
Leveraging human intelligence and activity
Working at the intersection of three massive trends: powerful machine learning, cloud computing, and crowdsourcing, the AMPLab is creating a new Big Data analytics platform that combines Algorithms, Machines and People to make sense at scale.
Machine learning (ML) turns data into information and knowledge. While it is useful to view ML as a toolbox that can be deployed for many data-centric problems, our long-term goal is more ambitious—we are developing ML as a full-fledged engineering discipline.
Many claim that the “datacenter is the new computer” but datacenters do not provide the key services one needs for managing and understanding massive data. We are developing a scalable software platform to make using a datacenter for analytics as easy as using an individual computer today.
People will play a key role in Big Data applications – not simply as passive consumers of results, but as active providers and gatherers of data, and to solve ML-hard problems that algorithms on their own cannot solve. The AMPLab is building tools that include people as individuals and crowds for all phases of the analytics lifecycle.
News More »
- [O’Reilly Data Show] Podcast: Ben Lorica interviews Michael Franklin on the lasting legacy of AMPLab. - 11.17.16
- CACM November 2016: “Apache Spark: A Unified Engine for Big Data Processing” - 11.02.16
- The Computing Research Association Releases Statement on Data Science - 10.10.16
- [Science Daily] Data-cleaning tool for building better prediction models - 09.15.16
- Sanjay Krishnan et al. Win SIGMOD 2016 Best Demo Award - 06.30.16
Featured Project: Succinct – Fast, Memory-Efficient Storage
In-memory query execution is one of the keys to building low-latency, high-throughput distributed data stores for interactive queries (e.g., search, random access, regular expressions). However, as web services scale to larger data sizes, executing queries in memory becomes increasingly challenging. Past research has left a wide gap between systems that use fast but memory-intensive techniques such as secondary indexes, and slow but memory-efficient techniques such as data scans. To fill this gap, we have developed Succinct, a fast and memory-efficient distributed data store for interactive queries.
Unlike existing systems for interactive queries, Succinct simultaneously achieves three desirable properties: (a) support for sophisticated queries; (b) scalability with increasing data sizes; and (c) query interactivity. Succinct achieves these properties by executing a wide range of queries (search, count, random access, range queries, regular expressions) directly on a compressed representation of the input data.
Succinct uses a compression technique that empirically achieves compression close to that of gzip while supporting the above queries without the need for secondary indexes, data scans or data decompression. Succinct is thus able to execute queries in-memory for a much larger range of input sizes than systems that store indexes. For instance, on an Amazon EC2 cluster with 256GB main memory, Succinct is able to store more than half a terabyte of raw data in main memory while executing search queries in tens of milliseconds.
We recently announced the release of Succinct Spark, a Spark package that enables queries on compressed RDDs. To read more about Succinct and Succinct Spark, see the Succinct webpage, the NSDI’15 and the NSDI’16 papers on Succinct, and our blog post on Succinct Spark. We at AMPLab are working on several exciting follow up projects on Succinct. We will soon write about these projects on the Succinct webpage. Stay tuned!