XLDB-2013 Tutorials

Monday, September 9, 2013
07:30 AM Continental Breakfast, (registration starts, Knight Management Center)
8:30 AM Morning Session (coffee break 10:20-10:40 AM)
A Optimizing Your Big Data Ecosystem Steve Brobst / Teradata
Tom Fastner / eBay
Room G101
B Using Hadoop for Batch and Real-Time Big Data Applications Sarah Sproehnle / Cloudera Room G102
C Hands on with Druid, An Open Source Real-time Data Store Fangjin Yang and Xavier Léauté / Metamarkets Room M104
D Semantic and Graph Technologies John Hax & Zhe Wu & Matthew Perry / Oracle Room M105
12:30 PM Lunch Break (take away lunch served)
1:00 PM Afternoon Session (coffee break 3:20-3:40 PM)
A Designing Good Algorithms for MapReduce Foto N. Afrati & Anish Das Sarma / Google
Magdelena Balazinska & Bill Howe / UW
Semih Salihoglu & Jeffrey D. Ullman / Stanford
Room G101
B MongoDb Text Search  Paul Pedersen / 10gen Room M104
C Data Structures and Algorithms for Big Databases Michael A. Bender / SUNY Stony Brook & Tokutek
Bradley C Kuszmaul / MIT & Tokutek
Room G102
D Do the Math: Using SciDB and SciDB-R for scalable in-database complex analytics Paul Brown & Alex Poliakov / Paradigm4 Room M105
5:00 PM Adjourn

Optimizing Your Big Data Ecosystem
Stephen Brobst, Chief Technology Officer, Teradata
Tom Fastner / Senior Member or Technical Staff, eBay

Big Data exploitation has the potential to revolutionize the analytic value proposition for organizations who are able to successfully harness these capabilities. However, the architectural components necessary for success in Big Data analytics are different than those used in traditional data warehousing. This tutorial will provide a framework for Big Data exploitation along with recommendations for architectural deployment of Big Data solutions. We will discuss tools and frameworks such as MapReduce, Hadoop, SQL++, Sqoop, etc. along with approaches to successfully deploy advanced technologies in this arena. The importance of the NoSQL and NoETL paradigms will be described along with the implementation techniques for realizing the advantages of these capabilities. We will explain which use cases are appropriate for which components of a Big Data ecosystem and how to architect inter-operability among the various technology platforms used in a best practices implementation. Our approach emphasizes the ability to quickly take advantage of emerging Big Data technologies with maximum leverage of existing sill sets and assets within an enterprise.

  • Learn about the critical components of a Big Data ecosystem.
  • Learn about the use cases that are most appropriate for Big Data solutions.

Learn how to architect an optimized ecosystem for Big Data exploitation.

Using Hadoop for Batch and Real-Time Big Data Applications
Sarah Sproehnle / Cloudera

This tutorial will explain how to process really large data sets on industry-standard computers using Apache Hadoop.  While Hadoop is well-known for its batch processing capabilities using MapReduce, this session will also go into the real-time capabilities of complementary technology like Apache HBase and Impala.

Topics include:

  • The history of Hadoop
  • Writing a Java MapReduce program

Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing Data analysis with Hive, Pig and Impala Overview of writing applications that use Apache HBase

Hands on with Druid, An Open Source Real-time Data Store
Fangjin YangXavier Léauté / Metamarkets

In this tutorial, we will explore Druid, an open source, real-time data store designed to work with high volume, high dimension data. Druid enables fast aggregations and arbitrary filters, supports both batch and streaming data ingestion, and seamlessly connects with popular storage systems ― including S3, HDFS, Cassandra, and more. We will go over how to use Druid for OLAP workflows, how to get your data into Druid, and how to develop visual dashboards on top of Druid.

After completing this session, attendees will be able to:

  •  Understand the design decisions and common pitfalls of building a real-time data store
  •  Setup and use Druid to ingest and query events as they happen in real-time
  •  Use the query API and available query libraries (SQL, R, Ruby, Python, and JS)
  •  Build a simple data visualization dashboard using the query libraries

Semantic and Graph Technologies
John Hax & Zhe Wu & Matthew Perry / Oracle

Semantic technologies are increasingly being utilized to discover and establish new relationships between disparate datasets, model complex real world application domains, and obtain more complete information using logical inference and graph pattern queries. Industries as diverse as Life Sciences, Genetics, and Defense/Intelligence are increasingly relying on Semantic Technologies to satisfy their data analysis needs. This tutorial will serve as an introduction to basic semantic technologies such as the Web Ontology Language (OWL), Resource Description Framework (RDF), graph types, the semantic data model, the querying of ontological data using both SPARQL and SQL, semantic indexing of documents, graph visualization with open source tool Cytoscape, and ontology editing using TopBraid Composer and Protege. We will also conduct a review of challenges for large graph implementations and use cases highlighting the capabilities of semantic technologies.

Designing Good Algorithms for MapReduce
Foto N. Afrati / Google
Magdelena Balazinska / UW
Anish Das Sarma / Google
Bill Howe / UW
Semih Salihoglu / Stanford
Jeffrey D. Ullman / Stanford

This tutorial will introduce attendees to techniques for designing efficient MapReduce algorithms. We will learn concepts that are critical to those engaged in Hadoop programming and the big-data and cloud-computing community in general. The concepts covered include:

  • Tradeoff between communication and computation cost
  • One-round vs. many rounds for MapReduce
  • Problems and solutions with load imbalances
  • Iterative algorithms

To illustrate the concepts above, we shall study MapReduce algorithms for Hamming  distance, similarity joins, relational joins, triangle finding, and matrix multiplication.

For more information see Summary.

MongoDB Text Search
Paul Pedersen / 10gen

Text search is one of the most requested features for MongoDB  and 10gen is actively working on an experimental text-search feature, that was released in v2.4. This tutorial will introduce attendees to MongoDB and go through the native implementation of text search. The tutorial will offer a short introduction to MongoDB and touch on schema design, query syntax and indexing. Attendees will get deep insights into specific mechanics in MongoDB text search including indexing, stemming, and stop words and current limitations of the text search feature by working through hands-on tasks. By the end of the tutorial, attendees will have better grasp of how to build a simple application with MongoDB utilizing the text search features.

Data Structures and Algorithms for Big Databases
Michael A. Bender / SUNY Stony Brook
Bradley C. Kuszmaul / MIT

This tutorial will explore data structures and algorithms for big databases. The topics include:

  • Data structures including B-trees, Log Structured Merge Trees, and Streaming B-trees.
  • Approximate Query Membership data structures including Bloom filters and cascade filters.
  • Index design, including covering indexes.
  • Getting good performance in memory.
  • Cache efficiency including both Cache-aware and Cache-oblivious data structures and algorithms

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQl and TokuDB.

Do the Math: Using SciDB and SciDB-R for scalable, in-database complex analytics
Paul Brown & Alex Poliakov / Paradigm4

SciDB is an open source, array-based analytical database system for use in scientific and commercial applications that involve large, diverse multi-dimensional data sets and scalable complex analytics. It runs on commodity hardware grids or in a cloud.

SciDB goes beyond other distributed parallel processing frameworks in that it also supports fast execution of non-embarrassingly parallel operations like large-scale sparse and dense linear algebra.

SciDB-R is an R-front end that lets users develop natively in R while using SciDB as both a storage backend and to seamlessly offload large computations.

This tutorial starts with an overview of the SciDB architecture, the array data model, the programming and query interfaces, math library, and data management capabilities. We also show how to access SciDB data from R and how to do large-scale math operations from SciDB-R.

The tutorial includes hands-on development exercises based on use cases from computational genomics, location-based services, and others, to show how to:

  • Load data from various standard file formats (CSV, HDFS, HDF5, FITS, FASTA, BAM, et al)

  • Assess schema designs

  • Extend SciDB with user-defined types, user-defined aggregates and operators

  • Integrate math functionality from packages on CRAN as well as R’s graphing and visualization tools.