DataHub: A Collaborative Data Analytics and Visualization Platform
In this talk, I will describe a new system we are building at MIT, called DataHub. DataHub is a hosted interactive data processing, sharing, and visualization system for large-scale data analytics. Key features of DataHub include:
(i) Flexible ingest and data cleaning tools to help massage data into a form that users can write programs that operate on it. This includes both removing irregularity as well as exposing structure from unstructured data such as text files and images.
(ii) A scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets, by exploiting massive parallelism available in modern GPUs and upcoming manycore CPUs.
(iii) An interactive visualization system that is tightly coupled to the data processing and lineage engine. Specifically, DataHub provides a workflow-based visualization engine where users can choose from a library of pre-built visualizations, or define their own visualizations via a simple API. Analysis and visualization steps may run on either CPUs or manycore/GPU devices.
(iv) Finally, Datahub is a hosted data platform, designed to eliminate the need for users to manage their own database. It includes features that allow users to selectively share their data with other users, using complex context-sensitive predicates (e.g., that data about particular times or location should be visible to particular users).
This joint meeting of the Boston Chapter of the IEEE Computer Society and GBC/ACM will be held in MIT Room E51-325. E51 is the Tang Center on the corner of Wadsworth and Amherst Sts and Memorial Dr.; it's mostly used by the Sloan School. You can see it on this map of the MIT campus. Room 325 is on the 3rd floor.