DataHub: A Collaborative Data Analytics and Visualization Platform

Thursday, June 19, 2014 - 7:00pm
Sam Madden, Professor of Electrical Engineering and Computer Science in MIT CSAIL
Lecturer Photo

In this talk, I will describe a new system we are building at MIT, called DataHub. DataHub is a hosted interactive data processing, sharing, and visualization system for large-scale data analytics. Key features of DataHub include:

(i) Flexible ingest and data cleaning tools to help massage data into a form that users can write programs that operate on it. This includes both removing irregularity as well as exposing structure from unstructured data such as text files and images.

(ii) A scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets, by exploiting massive parallelism available in modern GPUs and upcoming manycore CPUs.

(iii) An interactive visualization system that is tightly coupled to the data processing and lineage engine. Specifically, DataHub provides a workflow-based visualization engine where users can choose from a library of pre-built visualizations, or define their own visualizations via a simple API. Analysis and visualization steps may run on either CPUs or manycore/GPU devices.

(iv) Finally, Datahub is a hosted data platform, designed to eliminate the need for users to manage their own database. It includes features that allow users to selectively share their data with other users, using complex context-sensitive predicates (e.g., that data about particular times or location should be visible to particular users).

This joint meeting of the Boston Chapter of the IEEE Computer Society and GBC/ACM will be held in MIT Room E51-325. E51 is the Tang Center on the corner of Wadsworth and Amherst Sts and Memorial Dr.; it's mostly used by the Sloan School. You can see it on this map of the MIT campus. Room 325 is on the 3rd floor.

Samuel Madden is a Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory. His research interests include databases, distributed computing, and networking. Research projects include the C-Store column-oriented database system, the CarTel mobile sensor network system, and the Relational Cloud "database-as-a-service". Madden is a leader in the emerging field of "Big Data", heading the Intel Science and Technology Center (ISTC) for Big Data, a multi-university collaboration on developing new tools for processing massive quantities of data. He also leads BigData@CSAIL, an industry-backed initiative to unite researchers at MIT and leaders from industry to investigate the issues related to systems and algorithms for data that is high rate, massive, or very complex.

Madden received his Ph.D. from the University of California at Berkeley in 2003 where he worked on the TinyDB system for data collection from sensor networks. Madden was named one of Technology Review's Top 35 Under 35 in 2005, and is the recipient of several awards, including an NSF CAREER Award in 2004, a Sloan Foundation Fellowship in 2007, best paper awards in VLDB 2004 and 2007, and a best paper award in MobiCom 2006.