|TUESDAY, SEPTEMBER 10, 2013, 3:45-4:25PM SESSIONS|
|A||Dynamic Reduction of Query Result Sets for Interactive Visualization||Leilani Battle and Michael Stonebraker / MIT
Remco Chang / Tufts University
|B||Delivering Big Data Technologies as a Service||Dj Das and a technical team / Third Eye Consulting Services & Solutions LLC|
|WEDNESDAY, SEPTEMBER 11, 2013, 3:40-4:20PM SESSIONS|
|C||Analyzing 30 Million Patient Lives Across a Million Variable Combinations||Arijit Sengupta, CEO / BeyondCore|
|D||On Creating a Scalable In-Memory Database for Real-time Analytics||Eric Frenkeil, CEO and Nikita Shamgunov, CTO / MemSQL|
Leilani Battle and Michael Stonebraker / MIT
Remco Chang / Tufts University
Physical and biological sciences are increasingly becoming data driven, often by massive quantities of data collected from imagers, sequencers, and other sensors. One key issue for scientists working on massive datasets is visualization of their data to extract patterns, observe anomalies, and debug workflows. Various systems today offer easy and fast ways to visualize scientific data. Tableau transforms data into detailed interactive visualizations. Analysis tools like Matlab and R are frequently used to visualize data. In addition, these systems can connect directly to databases for easy data access. However, many of these systems do not scale seamlessly to massive scientific datasets. These systems often involve transferring data from the DBMS to the front-end visualization system, placing the burden of efficiently drawing massive query results on the visualizer. Many large-scale visualization systems instead rely on fitting entire datasets within memory, tying data analytics and visualization directly to the management of big data. However, this forces front-end visualizers to become data management systems themselves, limiting their use within the scientific community. In addition, many systems do not plot massive datasets effectively, resulting in severe over-plotting. We address the above problems for massive datasets stored in a DBMS by performing resolution reduction on query results. We insert a middle layer between the front-end and DBMS that dynamically determines when the result of a query is too large to be effectively rendered by the front-end. The middle layer reduces the result accordingly by inserting aggregation, sampling and/or filtering operations into the original query. Our prototype, ScalaR, gives the user a Google Maps-style interface as a front-end and connects to SciDB for data management. Our demo lets users explore a multi-terabyte data base of satellite imagery, freely and interactively, reducing resolution when necessary using the above techniques.
Deliverying Big Data Technologies as a Service
Dj Das and a technical team / Third Eye Consulting Services & Solutions LLC
Big Data technologies like Hadop, HBase, Cassandra, MongoDb, etc. and their related ecosystems have matured over the last couple of years and are now being considered for use in production environments at various enterprises. But maintaining, operating & supporting users on these technologies on a daily basis brings in a whole new set of challenges to the already stretched IT budgets of enterprises. The time and effort required for such jobs, the high cost of maintaining professionals adequately trained in these technologies & lack of man-power in the industry are some the reasons inhibiting enterprises from embracing these Big Data technologies. Delivering Big Data technologies as a Service, hosted on cloud or in-premise is the answer, as this demo will showcase.
Analyzing 30 Million Patient Lives Across a Million Variable Combinations
Arijit Sengupta, CEO / BeyondCore
McKinsey and BeyondCore conducted a hypothesis-free analysis of 30M+ commercial claims lives to predict which patients are most likely to experience extreme increases in their total annual medical expenses. From an analytical perspective, due to the large number of potential variable combinations, it is impossible to preconceive of and evaluate all of the potential hypotheses. Moreover, even when a statistically-significant pattern is detected, care must be taken to adjust for the confounding effects of all other variables. The amount of analytical effort to conduct such an analysis for a dataset spanning demographics, diagnoses, treatments, preexisting conditions, medication usage, etc. is not manually tractable. Consequently, to make the analysis tractable, analysts tend to cherry-pick variables relevant for specific hypotheses. This however prevents a comprehensive review of the overall context. Using the power of Big Data analytics, the consultancy Objective Health, a McKinsey & Company solution for healthcare providers, reversed the traditional approach, first using BeyondCore?s automated solution to rapidly evaluate a million variable combinations to detect statistically significant patterns and then using experts to form and test hypotheses that might explain the identified patterns. The complete results of this unique analysis will be presented on September 27 at Strata Rx. This demo will focus on the technologies used in this analysis.
On Creating a Scalable In-Memory Database for Real-time Analytics
Eric Frenkiel, Stanford, CEO and Nikita Shamgunov, CTO / MemSQL
We will demonstrate a real-time analytics use case in which 25+ TB of data are analyzed and updated in memory across 100 machines using a map-reduce paradigm through an ANSI SQL interface. In addition to the real-time analysis, the database will sustain a high rate of insertion, >1 million transactions/second. The talk will also review how using lock-free data structures, namely lock-free skiplists and hash tables, enables the database to cope with high concurrency and highly parallelized use cases.