A Guide to Massive-Scale Data Processing in Practice. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. Part I explains how Hadoop and MapReduce work, while. Code for the book "Big Data for Chimps" from O'Reilly - Big Data for Chimps. Finding patterns in massive event streams can be difficult, but learning how to find them doesn't have to be. This unique hands-on guide shows you how to solve.

Big Data For Chimps Pdf

Language:English, French, Dutch
Published (Last):03.02.2016
ePub File Size:18.73 MB
PDF File Size:14.28 MB
Distribution:Free* [*Register to download]
Uploaded by: CLETA

7OIN7PSFTS // Big Data for Chimps # PDF. Big Data for Chimps. By Philip Kromer, Russell Jurney. To download Big Data for Chimps eBook, make sure you . Book Details. Book Name. Big Data for Chimps. Edition. 1st Edition. Category. Programming & IT. Type. [PDF|EPBU|AZW3|MOBI. ] PDF. ISBN. Big Data for Chimps - Ebook download as PDF File .pdf), Text File .txt) or read book online. A Seriously Fun Guide to Large-Scale Data Analytics in Practice.

The computation cannot be split into independent chunksyou may have to compute correlation between stocks in different chunks, if the chunks carry different stocks. If the data is split along the time line, you would still need to compute correlation between stock prices at different points of time, which may be in different chunks.

For iterative computations, Hadoop MR is not well-suited for two reasons. One is the overhead of fetching data from HDFS for each iteration which can be amortized by a distributed caching layer , and the other is the lack of long-lived MR jobs in Hadoop.

Typically, there is a termination condition check that must be executed outside of the MR job, so as to determine whether the computation is complete. This implies that new MR jobs need to be initialized for each iteration in Hadoopthe overhead of initialization could overwhelm computation for the iteration and could cause significant performance hits.

The other perspective of Hadoop suitability can be understood by looking at the characterization of the computation paradigms required for analytics on massive data sets, from the National Academies Press NRC They term the seven categories as seven giants in contrast with 15 the dwarf terminology that was used to characterize fundamental computational tasks in the super-computing literature Asanovic et al.

These are the seven giants: 1. Basic statistics: This category involves basic statistical operations such as computing the mean, median, and variance, as well as things like order statistics and counting. The operations are typically O N for N points and are typically embarrassingly parallel, so perfect for Hadoop. Linear algebraic computations: These computations involve linear systems, eigenvalue problems, inverses from problems such as linear regression, and Principal Component Analysis PCA.

Moreover, a formulation of multivariate statistics in matrix form is difficult to realize over Hadoop. Examples of this type include kernel PCA and kernel regression.

Generalized N-body problems: These are problems that involve distances, kernels, or other kinds of similarity between points or sets of points tuples.

Computational complexity is typically O N2 or even O N3. The typical problems include range searches, nearest neighbor search problems, and nonlinear dimension reduction methods.

The simpler solutions of N-body problems such as k-means clustering are solvable over Hadoop, but not the complex ones such as kernel PCA, kernel Support Vector Machines SVM , and kernel discriminant analysis. Graph theoretic computations: Problems that involve graph as the data or that can be modeled graphically fall into this category.

The computations on graph data include centrality, commute distances, and ranking. When the statistical model is a graph, graph search is important, as are computing probabilities which are operations known as inference.

Some graph theoretic computations that can be posed as linear algebra problems can be solved over Hadoop, within the limitations specified under giant 2.

Euclidean graph problems are hard to realize over Hadoop as they become generalized N-body problems. Moreover, major computational challenges arise when you are dealing with large sparse graphs; partitioning them across a cluster is hard. Optimizations: Optimization problems involve minimizing convex or maximizing concave a function that can be referred to as an objective, a loss, a cost, or an energy function.

These problems can be solved in various ways. Stochastic approaches are amenable to be implemented in Hadoop. Mahout has an implementation of stochastic gradient descent.

Linear or quadratic programming approaches are harder to realize over Hadoop, because they involve complex iterations and operations on large matrices, especially at high dimensions. One approach to solve optimization problems has been shown to be solvable on Hadoop, but by realizing a construct known as All-Reduce Agarwal et al.

However, this approach might not be fault-tolerant and might not be generalizable. Conjugate gradient descent CGD , due to its iterative nature, is also hard to realize over Hadoop. The work of Stephen Boyd and his colleagues from Stanford has precisely addressed this giant. Their paper Boyd et al. Integrations: The mathematical operation of integration of functions is important in big data analytics. They arise in Bayesian inference as well as in random effects models.

Quadrature approaches that are sufficient for low-dimensional integrals might be realizable on Hadoop, but not those for high-dimensional integration which arise in Bayesian inference approach for big data analytical problems. Most recent applications of big data deal with high-dimensional datathis is corroborated among others by Boyd et al.

MCMC is iterative in nature because the chain must converge to a stationary distribution, which might happen after several iterations only. Alignment problems: The alignment problems are those that involve matching between data objects or sets of objects. They occur in various domainsimage de-duplication, matching catalogs from different instruments in astronomy, multiple sequence alignments used in computational biology, and so on.

The simpler approaches in which the alignment problem can be posed as a linear algebra problem can be realized over Hadoop. The catalog cross-matching problem can be posed as a generalized N-body problem, and the discussion outlined earlier in point 3 applies.

The limitations of Hadoop and its lack of suitability for certain classes of applications have motivated some researchers to come up with alternatives. Researchers at the University of Berkeley have proposed Spark as one such alternativein other words, Spark could be seen as the next-generation data processing alternative to Hadoop in the big data space. In the previous seven giants categorization, Spark would be efficient for Complex linear algebraic problems giant 2 Generalized N-body problems giant 3 , such as kernel SVMs and kernel PCA Certain optimization problems giant 4 , for example, approaches involving CGD An effort has been made to apply Spark for another giant, namely, graph theoretic computations in GraphX Xin et al.

It would be an interesting area of further research to estimate the 17 efficiency of Spark for other classes of problems or other giants such as integrations and alignment problems.


Initial performance studies have shown that Spark can be times faster than Hadoop for certain applications. This book explores Spark as well as the other components of the Berkeley Data Analytics Stack BDAS , a data processing alternative to Hadoop, especially in the realm of big data analytics that involves realizing machine learning ML algorithms.

When using the term big data analytics, I refer to the capability to ask questions on large data sets and answer them appropriately, possibly by using ML techniques as the foundation. I will also discuss the alternatives to Spark in this spacesystems such as HaLoop and Twister. The other dimension for which the beyond-Hadoop thinking is required is for real-time analytics.

It can be inferred that Hadoop is basically a batch processing system and is not well suited for real-time computations. Consequently, if analytical algorithms are required to be run in real time or near real time, Storm from Twitter has emerged as an interesting alternative in this space, although there are other promising contenders, including S4 from Yahoo and Akka from Typesafe.

Storm has matured faster and has more production use cases than the others. Thus, I will discuss Storm in more detail in the later chapters of this bookthough I will also attempt a comparison with the other alternatives for real-time analytics.

The third dimension where beyond-Hadoop thinking is required is when there are specific complex data structures that need specialized processinga graph is one such example.

Twitter, Facebook, and LinkedIn, as well as a host of other social networking sites, have such graphs. They need to perform operations on the graphs, for example, searching for people you might know on LinkedIn or a graph search in Facebook Perry There have been some efforts to use Hadoop for graph processing, such as Intels GraphBuilder.

However, as outlined in the GraphBuilder paper Jain et al. GraphLab Low et al. By processing, I mean running page ranking or other ML algorithms on the graph. GraphBuilder can be used for constructing the graph, which can then be fed into GraphLab for processing. GraphLab is focused on giant 4, graph theoretic computations. The use of GraphLab for any of the other giants is an interesting topic of further research. The emerging focus of big data analytics is to make traditional techniques, such as market basket analysis, scale, and work on large data sets.

This is reflected in the approach of SAS and other traditional vendors to build Hadoop connectors. The other emerging approach for analytics focuses on new algorithms or techniques from ML and data mining for solving complex analytical problems, including those in video and real-time analytics.

Big Data for Chimps

My perspective is that Hadoop is just one such paradigm, with a whole new set of others that are emerging, 18 including Bulk Synchronous Parallel BSP -based paradigms and graph processing paradigms, which are more suited to realize iterative ML algorithms. The following discussion should help clarify the big data analytics spectrum, especially from an ML realization perspective.

This should help put in perspective some of the key aspects of the book and establish the beyond-Hadoop thinking along the three dimensions of real-time analytics, graph computations, and batch analytics that involve complex problems giants 2 through 7. Big Data Analytics: Evolution of Machine Learning Realizations I will explain the different paradigms available for implementing ML algorithms, both from the literature and from the open source community.

First of all, heres a view of the three generations of ML tools available today: 1. These allow deep analysis on smaller data setsdata sets that can fit the memory of the node on which the tool runs. These allow what I call a shallow analysis of big data.

These facilitate deeper analysis of big data. Recent efforts by traditional vendors such as SAS in-memory analytics also fall into this category.

However, not all of them can work on large data setslike terabytes or petabytes of datadue to scalability limitations limited by the nondistributed nature of the tool. In other words, they are vertically scalable you can increase the processing power of the node on which the tool runs , but not horizontally scalable not all of them can run on a cluster.

The first-generation tool vendors are addressing those limitations by building Hadoop connectors as well as providing clustering optionsmeaning that the vendors have made efforts to reengineer the tools such as R and SAS to scale horizontally. These tools are maturing fast and are open source especially Mahout. Mahout has a set of algorithms for clustering and classification, as well as a very good recommendation algorithm Konstan and Riedl Mahout can thus be said to work on big data, with a number of production use cases, mainly for the recommendation system.

I have also used Mahout in a production system for realizing recommendation algorithms in financial domain and found it to be scalable, though not without issues. I had to tweak the source significantly.

One observation about Mahout is that it implements only a smaller subset of ML algorithms over Hadooponly 25 algorithms are of production quality, with only 8 or 9 usable over Hadoop, meaning scalable over large data sets. These include the linear regression, linear SVM, the K-means clustering, and so forth. It does provide a fast sequential implementation of the logistic regression, with parallelized training. However, as several others have also noted see Quora. Overall, this book is not intended for Mahout bashing.

However, my point is that it is quite hard to implement certain ML algorithms including the kernel SVM and CGD note that Mahout has an implementation of stochastic gradient descent over Hadoop.

This has been pointed out by several others as wellfor instance, see the paper by Professor Srirama Srirama et al. What do I mean by iterative?

A set of entities that perform a certain computation, wait for results from neighbors or other entities, and start the next iteration.

I will explain these three primitives: daxpy is an operation that takes a vector x, multiplies it by a constant k, and adds another vector y to it; ddot computes the dot product of two vectors x and y; matmul multiplies a matrix by a vector and produces a vector output. In essence, the setup cost per iteration which includes reading from HDFS into memory overwhelms the computation for that iteration, leading to performance degradation in Hadoop MR.

In contrast, Twister distinguishes between static and variable data, allowing data to be in memory across MR iterations, as well as a combine phase for collecting all reduce phase outputs and, hence, performs significantly better.

The other second-generation tools are the traditional tools that have been scaled to work over Hadoop. The choices in this space include the work done by Revolution Analytics, among others, to scale R over Hadoop and the work to implement a scalable runtime over Hadoop for R programs Venkataraman et al.

The efforts in the third generation have been to look beyond Hadoop for analytics along different dimensions. I discuss the approaches along the three dimensions, namely, iterative ML algorithms, real-time analytics, and graph processing.

Researchers at the University of Berkeley have proposed Spark Zaharia et al. The main motivation for Spark was that the commonly used MR paradigm, while being suitable for some applications that can be expressed as acyclic data flows, was not suitable for other applications, such as those that need to reuse working sets across iterations. So they proposed a new paradigm for cluster computing that can provide similar guarantees or fault tolerance FT as MR but would also be suitable for iterative and interactive applications.

The Berkeley researchers have proposed BDAS as a collection of technologies that help in running data analytics tasks across a cluster of nodes.

The lowest-level component of the BDAS is Mesos, the cluster manager that helps in task allocation and resource management tasks of the cluster. The second component is the Tachyon file system built on top of Mesos. Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. Spark, the computation paradigm, is realized over Tachyon and Mesos in a specific embodiment, although it could be realized without Tachyon and even without Mesos for clustering.

Zacharia et al. The HaLoop work Bu et al. Real-time Analytics The second dimension for beyond-Hadoop thinking comes from real-time analytics. Twitter from Storm has emerged as the best contender in this space. Storm is a scalable Complex Event Processing CEP engine that enables complex computations on event streams in real time. The components of a Storm cluster are Spouts that read data from various sources.

Bolts that process the data.

They run the computations on the streams. ML algorithms on the streams typically run here. When all design, develop, test and deploy work is done then you can finally start analyzing the data. All rights reserved. Datameer is a trademark of Datameer, Inc. Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

Other names may be trademarks of their respective owners. Ease of use Does the business need IT to Help? Can a business analyst use the tool to do analysis?

Big Data for Chimps

Data Integration Does system support native connectors to unstructured and semi-structured data sources e. Does it support flexible partitioning of the data so that it is easy to work with large amounts of data? Other names may be trademarks of their respective owners. Ease of use Does the business need IT to Help?

Can a business analyst use the tool to do analysis? Data Integration Does system support native connectors to unstructured and semi-structured data sources e. Does it support flexible partitioning of the data so that it is easy to work with large amounts of data? Does the solution support streaming of data so that users will have the most current data?

Does the solution provide data quality functions so that the data can be quickly normalized and transformed? Analytics Does the solution provide an intuitive environment e. Does the solution include pre-built analytic functions?You will see in a moment why we chose those letters. Both Kafka and Flume have evolved into general purpose solutions from their origins in high-scale server log transport but there are other use-case specific technologies.

It is not really random in the sense of nondeterministic. Computational complexity is typically O N2 or even O N3. AS gives the location and schema of your source data It ensures that your data is always available for use. Both of them are widely-adopted open source projects.