Accepted Lightning Talk Abstracts

TUE 1: RAW: High-performance queries over raw data
Miguel Branco / EPFL

Databases deliver impressive performance for large classes of workloads as the result of decades of research. High-performance, however, is achieved at the cost of versatility. In particular, databases only operate efficiently over loaded data, i.e. over data converted from its original raw format into the database's internal data format. For many user communities, however, loading (petabytes of) data into a database is not an option: it is too hard and costly to do so. In this talk we present RAW, a database that answers queries directly over user files, in any format, without data loading or database tuning. RAW is more than a database: it is a research platform where many radical ideas are exploited, such as removing the storage layer of a database, dynamically generating efficient access paths to external data, or creating adaptive tuning algorithms. We give a quick overview of RAW, including recent developments such as an ongoing integration with SciDB.

TUE 2: Three Generations of the Vertica Query Optimizer

Nga Tran / Vertica

The Vertica SQL Query Optimizer was written from the ground up for the Vertica Analytic Database. Its design, and the tradeoffs we encountered during implementation, support the case that the full power of novel database systems can be realized only with a custom Query Optimizer, carefully crafted exclusively for the system in which it operates.

In this talk, I will briefly describe how our experience with earlier optimizers at Vertica informed the design of the current Query Optimizer. I will then describe the main idea of enumeration algorithms and cost models. Finally, I will quickly show experimental results that evaluate the accuracy of our cost model, as well as the average planning time and planning memory required for actual customer queries.

TUE 3: 2012 Government Big Data Solutions Award: Analyzing Cancer-Genome Relationships
Tom Plunkett / Oracle

Oracle and the Frederick National Lab for Cancer Research won the 2012 Government Big Data Solutions Award for analyzing cancer-genome relationships with the Oracle Big Data Appliance, CDH, Hadoop, Hive, and Mahout.  This project was selected as the best over eighty competing big data projects.  This project was also selected as an Excellence.Gov finalist and a Computer World Honors Laureate.

TUE 4: Distributed R for Big Data
Indrajit Roy / HP

Data scientists use sophisticated algorithms to obtain insights. However, what usually takes tens of lines of MATLAB or R code is now been rewritten in Hadoop like systems and applied at scale in the industry. Instead of rewriting algorithms in a new model, can we stretch the limits of R and reuse it for analyzing Big Data? We will present our early experiences at HP Labs and Vertica as we attempt to answer this question. We have created a distributed execution engine for R that scales horizontally, efficiently performs machine learning on billions of rows of data, and can analyze graphs with billions of vertices and edges.

TUE 5: Low Latency Security Log Analysis with Hadoop HBase
Dianhui Zhu, Peter Ng, Dennis Bourque / Chevron

Our Web servers and firewalls produce several petabytes of data over the course of a few months.  Current analytical approaches are slow and proprietary. The Hadoop Distributed File System (HDFS) provides a scalable data platform and a sharable data repository on which all other big data solutions build. In this talk, we will address our efforts to build a log analytics system with HDFS and HBase for low latency interactive queries. We will describe how we built HBase tables for log entries and indexed them for low latency and real-time queries thereby reducing search time on extremely large datasets from hours to milliseconds. We will also share our experience on the challenges we had building the system.

TUE 6: Scaling the CDD Vault Technology for a Global Community of Drug Discovery Researchers' Private and Public Use Cases
Krishna Dole / Collaborative Drug Discovery (CDD)

Challenges associated with developing an unusually complex Ruby on Rails application on top of a MySQL database (the CDD Vault) will be described.  Idiosyncrasies associated with handling IP sensitive drug discovery data provided unique scaling challenges given the rigid security, partitioning, quality, and performance requirements. The approach has proven to be reliable, secure, and scalable (165,442 logins, 226,332,761 datapoints associated with 19,314,542 molecules with >99.98% uptime, >9 years hosted).  Securely integrating private with external data allows collaborative technologies like the CDD Vault to handle IP-sensitive, with public, data.  The data-partitioning model uniquely accommodates drug discovery data in the batch mode as well as allowing for partitioning down to an individual molecule batch or experimental measurement.

By layering unique collaborative capabilities upon requisite drug discovery database functionality can unlock latent synergy between biologists and chemists. A novel collaborative web-based hosted platform to archive, mine, and collaborate around biological and chemical data transcends traditional limitations of private and public databases by integrating private and external (collaborators and public data) into an integrated application. The application of collaborative technologies to interrogate potency, selectively, and therapeutic windows of small molecule enzyme, cell, and animal study data will be presented. The first collaborative example involves industry, academic, and government working broadly across CNS therapeutic indications for the NIH Neuroscience Blueprint (BPN) project.   The second collaborative example includes 7 Big Pharmas working with academic, government, and non-profit institutions as part of the BMGF TB Drug Accelerator (TBDA) consortium.   The third collaborative example is a 4-team collaboration between the USA and Africa to identify combinations that reverse chloroquine resistance.   The fourth collaborative example integrates bioinformatics and chemoinformatics with in vitro experimental validation to identify two leads against putative new Tuberculosis targets. Finally, the CDD Vault Enterprise (with the CDD API and Extensions) was connected to public databases (e.g. ChemBL, ChemSpider, PubMed, Zinc) and private databases to identify putative new targets for drug repositioning applications.  These examples demonstrate a more effective collaborative model using a secure, web-based collaborative platform capture complementary, specialized drug discovery experimental data for commercial and neglected disease applications.   Next generation technologies show the potential value with broader integration with the semantic web.

TUE 7: SciDB for Terabytes of Scientific Data - Experience at Berkeley Lab
Yushu Yao / LBNL/NERSC

As a division of the Lawrence Berkeley National Laboratory, National Energy Research Scientific Computing Center (NERSC) is the primary scientific computing facility for U.S. Department of Energy.  With the increase of data analytic needs, NERSC started the "extra-large database for science" project. As part of this project, we set up a large SciDB cluster and brought in a number of science teams to use it. We focus on two objectives:
1. A powerful back-end to science web services (or online databases). Most existing online science database let users search the "catalog" and return the "raw data" for download, then analysis offline. With the growth of data this is no longer practical. Instead, these web services should allow web users to ask "smart questions" and return "answers" instead of "raw data". This requires a high performance back-end;
2. An easy to learn, fast, interactive analytic platform. Solutions such as MPI+HDF5 has proven to perform well even at large scale (e.g. writing 50TB+ data using 100000+ CPU cores in 1 hour). However, the learning curve is steep, performance tuning is hard, and the requirement for hardware is high. Smaller projects normally don't have access to the intelligence or hardware. We investigate if SciDB, with its query like language and automatic parallelism, can increase the productivity of a science project dealing with <50TB of data on <1000 CPU cores.  

Current active projects using SciDB include bio-imaging, genomics, astronomy and climate science. The types of analytical algorithms not only include querying and aggregation, but also include more complicated algorithms such as clustering, dimension reduction, etc.  In this talk, I will present 1-2 representative use cases for each of the above objectives, and share the experiences, good and bad, of how well SciDB is suited for these scientific use cases.

TUE 8: GenBase: A Complex Analytics DBMS Benchmark
Sam Madden, Michael Stonebraker, Manasi Vartek, Rebecca Taft / Computer Science and Artificial Intelligence Laboratory M.I.T.

Previous DBMS benchmarks have typically focused on DBMS operations and/or simple SQL aggregates.  In scientific applications, there is a need for both data management services and much more complex analytics.  This is typical of genomics applications and in many other science domains.  Hence, we first describe a genomics-oriented benchmark that typifies applications in this field.  It consists of 5 queries with a mix of data management (joins, filters) and complex analytics (biclustering, regression, etc.). Then we run this benchmark on a variety of platforms for various sizes of genomics data.  These include Hadoop, Postgres, R, SciDB, SciDB with a Xeon Phi hardware assist,  and a popular columnar RDBMS.  Some of the results are expected (Hadoop does not perform well, R does analytics well until it runs out of memory, SciDB does well in general), some are a little surprising (R performs terribly on data management), and some are counterintuitive (Xeon Phi does not help until the problem size gets large, the cost of conversion to Scalapack format is a big performance issue). This talk concludes with a collection of observations we have learned about mixing data management and complex analytics.

TUE 9: Building a Network of E-Commerce Concepts
Sandip Gaikwad, Sanjay Ghatare, Nish Parikh, Rajendra Shinde / eBay

We present a method for developing a network of e-commerce concepts.   We define concepts as collection of terms that represent product entities or commerce ideas that users are interested in.  We start by looking at large corpora (Billions) of historical eBay buyer queries and seller item titles.  We approach the problem of concept extraction from corpora as a market-baskets problem by adapting statistical measures of support and confidence.  The concept-centric meta-data extraction pipeline is built over a map-reduce framework.  We constrain the concepts to be both popular and concise.  Evaluation of our algorithm shows that high precision concept sets can be automatically mined.  The system mines the full spectrum of precise e-commerce concepts ranging all the way from "ipod nano" to "I'm not a plastic bag" and from "wakizashi sword" to "mastodon skeleton".  Once the concepts are detected, they are linked into a network using different metrics of semantic similarity between concepts.  This leads to a rich network of e-commerce vocabulary. Such a network of concepts can be the basis of enabling powerful applications like e-commerce search and discover as well as automatic e-commerce taxonomy generation. We present details about the extraction platform, and algorithms for segmentation of short snippets of e-commerce text as well as detection and linking of concepts.

WED 1: Separation of Hadoop Processing and Storage at Netflix
Kurt Brown, Director, Data Platform / Netflix

The standard setup for Hadoop is HDFS for storage, coupled with MapReduce for processing, on a large cluster of machines. At Netflix, nearly all of our infrastructure is in the (Amazon) cloud. We've found that while the standard setup works, we get more benefit by decoupling the storage and processing. Rather than using intra-cluster HDFS, we store all of our data in an independent, distributed object store (S3) and stream it into our Hadoop clusters as needed. We give up some data locality, but this is outweighed by the durability, availability, and accessibility benefits we realize. It lets us spin up multiple Hadoop (and non-Hadoop) clusters that can all leverage the same data seamlessly. Software upgrades and recovery from massive cluster failures are minimally impactful events for us. While not the perfect paradigm for many large-scale data processing systems, freeing your data from a "single" processing engine is worth consideration. The intrinsically scalable and ephemeral nature of cloud computing make this especially compelling.

WED 2: The Importance of Late Binding for Big Data Analytics
Stephen Brobst / Teradata Corporation

The term "Big Data" is very often associated with high volume data.  However, the "bigness" of the data is the least interesting aspect of big data analytics.  The true value and innovation in big data analytics comes from the ability to extract value from non-traditional data that does not necessarily fit cleanly into a relational model.  This gives rise to a new approach for acquisition of data into an analytic repository using "NoETL" techniques whereby structure is not added to the data at load time - but rather at query time using a late binding style of data access.

WED 3: Large-Scale Text Analytics - MapReduce or Parallel DBMSs?
Fei Chen, Meichun Hsu / HP Labs

Text analytics has become increasingly important with the rapid growth of text data. While MapReduce has been the natural choice for tackling the scalability challenge  in text processing, there are emerging interests  in  pushing text analytics  inside scaled-out DBMSs, where text analysis programs are encapsulated as user defined functions (UDFs), and DBMS engines are leveraged to combine text analysis functions with relational operators for large scale multi-step complex text analysis tasks.  Scaled-out structured DBMSs parallelize UDFs using multiple independent machines as a MapReduce platforms do, while offering built-in relational operators for manipulating a large quantity of raw or intermediate data in its parallel query pipelines. The question is whether a structured DBMS is a viable technology for processing large amounts of text data. In this lightning talk we present our experiments in processing multi-step information extraction tasks both by a scaled-out column store DBMS as well as on a MapReduce platform. Information extraction extracts structured data from text using a range of methods using models constructed from machine learning algorithms or using a large number of manually constructed rules. It has received significant attention in recent years due to the explosive growth of text data, and it is often computationally intensive. The result of our experiments answers the question of whether a structured DBMS is a viable choice for analyzing unstructured text data. More importantly, this work illustrates a framework for constructing benchmarks for complex multi-step analytic data  flows that require a combination of special analysis functions and relational operators on large amounts of text data, and offers insights on tradeoffs between MapReduce and Parallel DBMS technologies  for text analytics.

WED 4: Efficient k-mer indexing for metagomics search in NVRAM
Sasha Ames / Lawrence Livermore National Laboratory

Metagenomic analysis, the study of microbial communities found in environmental samples, presents considerable challenges in quantity of data and computational cost. We present a novel metagenomic analysis pipeline that leverages emerging large address space compute nodes with NVRAM to hold a searchable, memory-mapped "k-mer" database of all known genomes and their taxonomic lineage. We describe challenges to creating the many hundred gigabytes-sized databases and describe database organization optimizations that enable our Livermore Metagenomic Analysis Toolkit (LMAT) software to effectively query the k-mer key-value store, which resides in high performance flash storage, as if fully in memory. To make database creation tractable, we have designed, implemented, and evaluated an optimized ingest pipeline. To optimize query performance for the database, we present a two-level\index scheme that yields speedups of 8.4\times - 74\times over a conventional hash table index.

WED 5: Delivering Big Data Technologies as a Service
Dj Das / Third Eye Consulting Services & Solutions LLC

Big Data technologies like Hadop, HBase, Cassandra, MongoDb, and their related ecosystems have matured over the last couple of years and are now being considered for use in production environments at various enterprises. But maintaining, operating & supporting users on these technologies on a daily basis brings in a whole new set of challenges to the already stretched IT budgets of enterprises. The time and effort required for such jobs, the high cost of maintaining professionals adequately trained in these technologies & lack of man-power in the industry are some the reasons inhibiting enterprises from embracing these Big Data technologies. Delivering Big Data technologies as a Service, hosted on cloud or in-premise is the answer, as this talk will showcase.

WED 6: Analyzing 30 million patient lives across a million variable combinations
Arijit Sengupta / BeyondCore, CEO

McKinsey and BeyondCore conducted a hypothesis-free analysis of 30M+ commercial claims lives to predict which patients are most likely to experience extreme increases in their total annual medical expenses. From an analytical perspective, due to the large number of potential variable combinations, it is impossible to preconceive of and evaluate all of the potential hypotheses. Moreover, even when a statistically-significant pattern is detected, care must be taken to adjust for the confounding effects of all other variables. The amount of analytical effort to conduct such an analysis for a dataset spanning demographics, diagnoses, treatments, preexisting conditions, medication usage, etc. is not manually tractable. Consequently, to make the analysis tractable, analysts tend to cherry-pick variables relevant for specific hypotheses. This however prevents a comprehensive review of the overall context. Using the power of Big Data analytics, the consultancy Objective Health, a McKinsey & Company solution for healthcare providers, reversed the traditional approach, first using BeyondCore's automated solution to rapidly evaluate a million variable combinations to detect statistically significant patterns and then using experts to form and test hypotheses that might explain the identified patterns. The complete results of this unique analysis will be presented on September 27 at Strata Rx. This session will focus primarily on the methodologies and technologies used in this analysis and the benefits of this approach.

WED 7: Distributed Query Processing with MonetDB
Mengmeng Chen, Masood Mortazavi / Futurewei Technologies and Fang Cao, Stephen Morgan, Aniket Adnaik, Yanchen Liu / Huawei

Columned-oriented databases have been proven to excel in analytical workload by many works.  For example, one of the column-store pioneers, MonetDB, has shown superior results in the TPC-H benchmark in recent years. On the other hand, although there exist many distributed query processing engines with row-oriented databases, building one with column-oriented database that can still benefit from its superior single node performance presents a series of new challenges as well as opportunities in both query execution and optimization. In this talk, we will discuss the series of challenges that we're facing in building our distributed query processing engine using MonetDB and our solutions to them, such as trade-offs in exploring execution parallelism and pipelining in single MonetDB instance, distributed code generation without sabotaging the existing optimization in MonetDB, column materialization in distributed query execution, distributed execution scheduling given based on the existing supports from MonetDB and etc. We believe our topics involve both aspects that are particular to MonetDB's unique implementation as well as problems that are general to any distributed query processing engine with a column-oriented database.

WED 8: In-Situ Processing of Genomic Sequence Alignment Data
Florin Rusu, Yu Cheng / University of California, Merced

SAM/BAM files are the standard result of genomic sequence alignment. These files consist of a series of tuples -- known as reads -- encoding how fragments of the sequenced genome align relative to a reference genome. Processing of SAM/BAM files is done with specialized tools such as SAMtools and Picard which provide a limited set of operations implemented as scans over the set of tuples. These tools do not provide any data management support nor a declarative query language and are not extensible. They also have poor performance for large files due to lack of parallelism. We introduce an in-situ processing system for genomic alignment data that provides the efficiency of a parallel database while operating directly over SAM/BAM files. Our main contribution is the design of a specialized storage manager for genomic data incorporating several novel ideas. First, the storage manager has a pipelined architecture in which reading and tokenizing/parsing are independent stages operating in parallel. Second, the storage manager provides support for multi-query processing by overlapping reading from the input file. And third, data read from files are transparently loaded into the system in order to improve subsequent processing. The storage manager reads data simultaneously from the database and SAM/BAM files while optimizing access to disk. When implemented inside the EXTASCID scientific data processing system, the proposed storage manager provides some unexpected insights on the implementation of SAM/BAM file processing tools. Parallel processing is useless for BAM files. Although BAM files are compressed versions of SAM files, thus considerably smaller in size, the processing of a SAM file is a factor of 2 to 3 faster than the corresponding BAM file. While SAM file processing is clearly I/O-bound, decompression is by far the limiting factor for BAM file processing.

WED 9: I of T - Proxy Control
Justin Simonds / HP EG Americas Master Technologist

Through the Internet of Things and event processing there is a greater and greater need for Machine to Machine integration.  Most companies have nonintegrated operational technology (OT) which can now be integrated with IT and made part of a company's Big Data architecture.  Much of this data can be seen as sensor data that may require event processing depending on the alert.  Likewise control over these devices becomes possible similar to reverse E911 service via broadcast or multicast. Companies, by way of example - utility companies, can provide enhanced big data services through the Internet of Things and M2M integration of smart meters.  By combining its customer control base with external sources of information (e.g. weather) informed decisions can be made for homeowners that are away (vacation, hospital, business trip), if the utility company is provided a proxy capability. This talk will discuss by way of example: the ability for a homeowner to provide temporary 'proxy' control to a Utility Company which can monitor and control the home.  A storm may go through allowing the utility company on proxy controlled homes to shutoff automatic sprinklers saving water, power and money.