Big Data exploitation has the potential to revolutionize the analytic value proposition for organizations who are able to successfully harness these capabilities. However, the architectural components necessary for success in Big Data analytics are different than those used in traditional data warehousing. This tutorial will provide a framework for Big Data exploitation along with recommendations for architectural deployment of Big Data solutions. We will discuss tools and frameworks such as MapReduce, Hadoop, SQL++, Sqoop, etc. along with approaches to successfully deploy advanced technologies in this arena. The importance of the NoSQL and NoETL paradigms will be described along with the implementation techniques for realizing the advantages of these capabilities. We will explain which use cases are appropriate for which components of a Big Data ecosystem and how to architect inter-operability among the various technology platforms used in a best practices implementation. Our approach emphasizes the ability to quickly take advantage of emerging Big Data technologies with maximum leverage of existing sill sets and assets within an enterprise.
- Learn about the critical components of a Big Data ecosystem.
- Learn about the use cases that are most appropriate for Big Data solutions.
Learn how to architect an optimized ecosystem for Big Data exploitation.
This tutorial will explain how to process really large data sets on industry-standard computers using Apache Hadoop. While Hadoop is well-known for its batch processing capabilities using MapReduce, this session will also go into the real-time capabilities of complementary technology like Apache HBase and Impala.
- The history of Hadoop
- Writing a Java MapReduce program
Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing Data analysis with Hive, Pig and Impala Overview of writing applications that use Apache HBase
In this tutorial, we will explore Druid, an open source, real-time data store designed to work with high volume, high dimension data. Druid enables fast aggregations and arbitrary filters, supports both batch and streaming data ingestion, and seamlessly connects with popular storage systems ― including S3, HDFS, Cassandra, and more. We will go over how to use Druid for OLAP workflows, how to get your data into Druid, and how to develop visual dashboards on top of Druid.
After completing this session, attendees will be able to:
- Understand the design decisions and common pitfalls of building a real-time data store
- Setup and use Druid to ingest and query events as they happen in real-time
- Use the query API and available query libraries (SQL, R, Ruby, Python, and JS)
- Build a simple data visualization dashboard using the query libraries
Semantic technologies are increasingly being utilized to discover and establish new relationships between disparate datasets, model complex real world application domains, and obtain more complete information using logical inference and graph pattern queries. Industries as diverse as Life Sciences, Genetics, and Defense/Intelligence are increasingly relying on Semantic Technologies to satisfy their data analysis needs. This tutorial will serve as an introduction to basic semantic technologies such as the Web Ontology Language (OWL), Resource Description Framework (RDF), graph types, the semantic data model, the querying of ontological data using both SPARQL and SQL, semantic indexing of documents, graph visualization with open source tool Cytoscape, and ontology editing using TopBraid Composer and Protege. We will also conduct a review of challenges for large graph implementations and use cases highlighting the capabilities of semantic technologies.
This tutorial will introduce attendees to techniques for designing efficient MapReduce algorithms. We will learn concepts that are critical to those engaged in Hadoop programming and the big-data and cloud-computing community in general. The concepts covered include:
- Tradeoff between communication and computation cost
- One-round vs. many rounds for MapReduce
- Problems and solutions with load imbalances
- Iterative algorithms
To illustrate the concepts above, we shall study MapReduce algorithms for Hamming distance, similarity joins, relational joins, triangle finding, and matrix multiplication.
For more information see Summary.
Text search is one of the most requested features for MongoDB and 10gen is actively working on an experimental text-search feature, that was released in v2.4. This tutorial will introduce attendees to MongoDB and go through the native implementation of text search. The tutorial will offer a short introduction to MongoDB and touch on schema design, query syntax and indexing. Attendees will get deep insights into specific mechanics in MongoDB text search including indexing, stemming, and stop words and current limitations of the text search feature by working through hands-on tasks. By the end of the tutorial, attendees will have better grasp of how to build a simple application with MongoDB utilizing the text search features.
This tutorial will explore data structures and algorithms for big databases. The topics include:
- Data structures including B-trees, Log Structured Merge Trees, and Streaming B-trees.
- Approximate Query Membership data structures including Bloom filters and cascade filters.
- Index design, including covering indexes.
- Getting good performance in memory.
- Cache efficiency including both Cache-aware and Cache-oblivious data structures and algorithms
These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQl and TokuDB.
Do the Math: Using SciDB and SciDB-R for scalable, in-database complex analytics
Paul Brown & Alex Poliakov / Paradigm4
SciDB is an open source, array-based analytical database system for use in scientific and commercial applications that involve large, diverse multi-dimensional data sets and scalable complex analytics. It runs on commodity hardware grids or in a cloud.
SciDB goes beyond other distributed parallel processing frameworks in that it also supports fast execution of non-embarrassingly parallel operations like large-scale sparse and dense linear algebra.
SciDB-R is an R-front end that lets users develop natively in R while using SciDB as both a storage backend and to seamlessly offload large computations.
This tutorial starts with an overview of the SciDB architecture, the array data model, the programming and query interfaces, math library, and data management capabilities. We also show how to access SciDB data from R and how to do large-scale math operations from SciDB-R.
The tutorial includes hands-on development exercises based on use cases from computational genomics, location-based services, and others, to show how to:
Load data from various standard file formats (CSV, HDFS, HDF5, FITS, FASTA, BAM, et al)
Assess schema designs
Extend SciDB with user-defined types, user-defined aggregates and operators
Integrate math functionality from packages on CRAN as well as R’s graphing and visualization tools.