XLDB-2013 Abstracts

Big-Data Analytics Usability
Magdalena Balazinska, Associate Professor / University of Washington

Increasingly, users of Big Data analytics systems are experts in some domain area (e.g., domain sciences) but are not necessarily experts in large-scale data processing. Today's Big Data analytics systems, however, remain difficult to use even by experts. In the CQMS project, we address important barriers to making Big Data analytics more seamless. In this talk, we will present a quick overview of some of the key challenges and our recent findings in the area of Big Data analytics usability. In particular, we will focus on the problems related to composing SQL queries and tuning their execution in a shared-nothing cluster: We will present the SnipSuggest system, which is an autocompletion tool for SQL queries, aimed to ease the query formulation process. It provides on-the-go, context-aware assistance in the query composition process. Second, we will describe, PerfXplain, a tool for explaining the performance of a MapReduce job running on a shared-nothing cluster. PerfXplain aims at helping users understand the performance they are getting from their analytics engine.

Reusing & Building Without Rediscovering for LSST
Jacek Becla / SLAC National Accelerator Laboratory

Scale and complexity of LSST data analysis require state-of-the-art solutions that no off-the-shelf system offers. The talk will discuss design choices behind the dozens-of-petabytes LSST database, explaining what is reused off-the-shelf, what is built custom, and how we decided.

Hardware, and Its Effect on Software: Prepare for Massive Change
Per Brashers / Yttibrium Consulting

The industry is at a series of inflection points, and they are all converging at nearly the same time. Changes in how storage is going to evolve, changes in networks, changes in memory systems, and even the CPU will be different in a few short years. Not to mention the costs associated with running these mongo-data centers is driving a whole new way of handling equipment and how it is cooled. This talk will focus on a few select areas that affect application designers now, and ignore many of the trends that will become a problem in years to come.

The Now and Later of Large Scale Computing at Chevron
Peter Breunig / Chevron

Finding, developing and extracting oil and gas from the subsurface is and has been a data driven exercise. Throughout the continuum from the first seismic experiments in the early part of the 20^th century, the first nuclear/acoustic/electromagnetic well logs, analyzing cores, production testing, 4D seismic and beyond petro-technical professionals have been dealing with “relatively” large data sets. As advances in hardware/software technologies, sensing technologies and math-physics have progressed the ratio of data to compute power has remained relatively constant as the goal is increasing the resolution of the subsurface. The talk will provide the path taken and where it might go in the future in and how it might connect to opportunities in the XLDB arena.

Big Data Tools: Game Changer for Mainstream Enterprises
Aashish Chandra / MetaScale

Shopping expectations for technology savvy consumers today have changed and retailers need to reinvent continuously through technology and innovation. For mainstream enterprises, this means transformation to be nimble and agile for competitive advantage. On its transformational journey to be open-sourced, cloud based, agile enterprise, Sears Holdings developed pioneering solutions and products to address its structured data as part of the overall enterprise data strategy and launched an IT Services company, MetaScale, to offer these and other patent pending Big Data solutions to external companies.

In this presentation, Aashish Chandra, DVP of Sears Holdings & GM of MetaScale, will focus on Big Data tools and modern data architectures to share the journey and challenges to drive business agility and generate additional revenue and profitability for Sears.

The MySQL Ecosystem at Scale
Jeremy Cole / Google, Inc.

I will start with a discussion of MySQL's roots, some common high-scale usage scenarios, its strengths and weaknesses at scale, and the future of the MySQL ecosystem. I will then present a brief overview of what it takes (and what doesn't work) to get to Twitter's scale with MySQL.

Evolution of Systems Infrastructure at Google
Jeffrey Dean / Google, Inc.

In this talk I'll discuss Google's hardware and software infrastructure, including systems such as GFS, MapReduce, BigTable, and Spanner. I'll describe the circumstances around how some of these projects were started, including their motivation, initial uses within Google, and the working style within the teams that built the initial versions of these systems.

In Search of Higgs-Boson the "Home-grown" Way
Dirk Duellmann / CERN

The scale and timeline for the LHC project have not made it always easy to construct a computing system from pre-existing components. In this talk, I will describe the challenges, failures (and successes) over two decades of preparation, and the reasoning, which led to the system that has been used to find the Higgs-Boson. The talk will include some background on storage and analysis concepts used in high energy physics, which more recently gained popularity in the wider BigData community and describe ongoing R&D activities.

Small Data at Peta Scale
Harrison Fisk and Domas Mituzas / Facebook

Sharing and interaction on Facebook results in massive social graph dataset that has trillions of edges, all of them having to be accessible at web access latencies in multiple datacenter regions. This talk by Facebook database engineers will discuss how building a storage system for such dataset has required work in various disciplines:

Caching, both inside and outside of database servers
Sharding, schema management and other data design choices
Replication for high availability and geographic distribution
Data access management in multi-tenant environment
DBA automation and large scale orchestration of database servers
Hardware choices and optimizations

Current implementation uses MySQL/InnoDB as storage backend.

XRootD - A Decade Later
Andrew Hanushevsky / SLAC National Accelerator Laboratory

The XRootD data clustering service, widely used for High Energy Physics and Astrophysics data, is celebrating its 10 year anniversary. Originally developed for the BaBar experiment and deployed in 2003 to provide a unique low-cost scalable data access mechanism. Today, XRootD continues to provide the same service but is also used in novel ways like data federations, mySQL clustering, and fronting high availability data servers. This talk looks at the technology in terms of reinvention. That is, what peculiar features in XRootD allowed it to technologically survive and still be relevant a decade later.

Transforming Retail with Multichannel Analytics
Anu Jain / Target

Consumers are constantly using new channels, new data sources, new media, and new applications to determine the best place, price and time to shop. Retailing has evolved from single channel to multichannel model. Multichannel analytics have extended the reach of retail analytics. Focus of this talk will be on how retail is transforming with multichannel analytics, key strategies, enablers and challenges.

Typical Data at the ExxonMobil Chemical Company
Steve Lacy / ExxonMobil Chemical

The ExxonMobil Chemical Company manufactures the building blocks for a wide range of products, from packaging materials and plastic bottles to automobile bumpers, synthetic rubber, solvents and countless consumer goods. The road from a research idea to one of these commercial products is a long one. Things that work at a small-scale in the lab do not necessarily work at the commercial-scale. This talk will explore a polymer scale-up workflow from catalyst synthesis to large-scale pilot plant testing. The associated data spans from heterogeneous scientific data (discrete measurements, time series graphs, spectra, images, and video) to business administration data (chemical inventory, work order management). Focus will be on scientific data – volume, heterogeneity and associated informatics challenges.

Comparative Analysis in an Expanding Universe of Microbial Genome & Metagenome Datasets
Victor Markowitz / Lawrence Berkeley National Laboratory

Microbial genome and microbial community metagenome analysis are expected to lead to advances in healthcare, environmental cleanup, agriculture, industrial processes, and alternative energy production. With the rapid growth in the number of microbial genome and microbial community metagenome sequence datasets, comparative data analysis plays a critical role in understanding the biology of newly sequenced organisms and communities. The effectiveness of comparative analysis depends on the availability of an integrated data context, powerful analytical tools, the quality of genome and metagenome data interpretation (annotation), and the level of detail in cellular reconstruction.

The IMG system (http://img.jgi.doe.gov) provides the data management environment required for studying microbial genomes and metagenomes in the context of an expanding universe of genome and metagenome datasets. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users integrated with bacterial, archaeal, eukaryotic, and viral genomes from public genome sequence archives and a large set of engineered, environmental and host-associated metagenome datasets. Genome and metagenome datasets are processed daily using IMG’s microbial genome and metagenome annotation pipelines and integrated weekly into the IMG data warehouse. IMG’s maintenance involves continuously adjusting the underlying data management infrastructure in order to cope with the rapid increase in the number and size of the genome and metagenome datasets and to accommodate new data types.

Data-Intensive Global Science - Zettabytes to Petabytes to Revelation
Richard Mount / ATLAS experiment at the CERN LHC

Science projects are increasingly challenged by the volume and complexity of their data. Particle physics still leads the way in pain and glory, but is being joined by other sciences in this uncomfortable leadership position. I will explore why data intensive big science can be even more challenging than big commerce and how zettabytes become petabytes and eventually revelation. Data reduction by throwing most of it away is easy, but data under analysis has a tendency to expand or explode in volume, especially in large collaborative projects. I will, of course, make limited attempts to predict the future of data-intensive big science.

Beyond Hadoop - Building the Analytics Infrastructure at Facebook
Ravi Murthy / Facebook

Data analysis is at the heart of product development at Facebook. Over the past few years, the analytics infrastructure has evolved rapidly to meet the demands of ever increasing data scale and variety of usecases. This talk will briefly present the key systems for batch and interactive analytics, real-time stream processing and large scale graph processing. Many of the components are derived from popular open source projects, with critical extensions to deal with Facebook scale and integration with Facebook ecosystem.

We will also talk about Facebook's participation in open source community, and our experiences working with different open source projects.

Big Data in Electronic Securities Trading and Its Implications for DBMS
Alex Nazaruk / Middle Lake Partners, LLC

Over the past decade global securities markets have dramatically changed. Evolution of market structure in combination with advances in computer technologies led to emergence of electronic securities trading. Securities transactions that used to be conducted in person and over the phone are now predominantly executed by automated trading systems. This resulted in vast increase in the the securities trading volumes and even greater increase in the number of trading orders.

In this talk we make an introduction into electronic securities trading. We present typical automated securities trading system and various types of data it uses. We also analyze foundations of quantitative trading model development process. We close with the implications Big Data in electronic trading has on DBMS requirements in capital markets.

Funding Big Data DBMS Technology at NSF: Research, Development, and Deployment
Frank Olken / National Science Foundation

Science is increasingly dominated by data intensive research involving very large data sets. The National Science Foundation supports research, development, and deployment of Big Data DBMS and related technologies through several programs. Research in DBMS technology and information retrieval, data mining, and machine learning is supported primarily by the CISE IIS core program and the CISE Big Data program. These programs fund basic research and the creation of research prototype software. The MPS CDS&E program supports more applied data intensive computing technologies. Development of DBMS and related technologies is supported by the CISE ACI DIBBS program with an emphasis on software tools for scientific research. This program funds the development of novel software tools for data preservation, access, interoperability which are useful across the scientific research community. Deployment of production DBMS systems in support of large instruments (telescopes, accelerators, etc.) or large data acquisition programs (such as NEON) is supported by the domain sciences divisions at NSF, such as Astronomy, Physics, and the Biology and Geosciences Directorates. These large projects typically prefer more mature DBMS technologies. The various DBMS related programs have very different goals, proposal evaluation criteria, panel structures, funding patterns, and staffing. In this talk I survey the major NSF programs and their different approaches to DBMS technologies.

Make it Big by Working Fast and Small: A VCs View of Large-Scale Success
Greg Papadopoulos / New Enterprise Associates

Despite popular perception, monster start-up successes are not overnight. Creating truly enduring value is a long process of building a company from a core handful of people to organizations and products that scale to the market. Some of the most successful patterns are starting teams who have a really big vision, but work toward it in very sharply focused and agile, constantly adapting ways. We'll explore some examples and lessons, and end with some observations about the opportunities in Extremely Big Data.

Funding Large-Scale Software Projects
Mike Stonebraker / M.I.T.

In this talk we use SciDB as an example of a large-scale system software project. This project has now existed for more than four years and required in excess of $5M in invested capital. We recount the way the project got started, the funding mechanisms we considered, how we explored these options, and the path we eventually took (venture capital financing) to provide the necessary resources.

A Vision and Research Program in "Big Data"
Mike Stonebraker / M.I.T.

In 2011, Intel conducted a national search for a next ISTC (Intel Science and Technology Center). From more than 150 applicants, Intel selected a research program on “big data” directed by Sam Madden and Michael Stonebraker at M.I.T. with external participation from principal investigators at the University of Washington, Brown University, the University of Tennessee, Carneigie Mellon University, Stanford University, Portland State University, and the University of California at Sana Barbara.

This talk presents the current (2013) vision for the center, projects that we are working on and results to date.

The SDSS SkyServer and Beyond: Why "The Not-So-Little-Engine that Could" is Still Chugging Along
Ani Thakar / John Hopkins University

The Sloan Digital Sky Survey Catalog Archive Server and its Web portal, the SkyServer, have been in operation since 2001 and still going strong. The SDSS was the first survey that ushered in the big data era by increasing the amount of catalog data available to astronomers by a factor of a thousand over the first few years of its operation. The multi-terabyte SDSS catalog archive has been a huge success by any measure, but that does not mean that it did not have its share of growing pains, mid-life crises and near-death experiences. This talk covers the good, e.g.,the extremely fruitful collaboration with Jim Gray that resulted in a new paradigm of data-intensive science with databases, and the extensive reusable building blocks that we built along the way; the bad, e.g. the lack of a plan to distribute data to mirror sites, and the impact on operational schedules and resources of the unprecedented data sizes; and the ugly, e.g. having to ditch our original object-oriented DBMS platform after spending several man-years of development effort on it. The focus throughout is on what can be learned from both the successes and failures in the SDSS experience, particularly in the context of the buy vs build dilemma.