Getting Started with Apache Crunch

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce, the Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For more information, you can visit Apache Crunch Homepage.

In this blog post, I'll show you how to write a word counting programming that you might familiar with if you know Hadoop, it's Hello World in Hadoop world. Firstly, let's look at the basic concepts in Crunch, including its type system and pipelined architecture.

  • Data Pipelines:
The crunch pipeline is represented with the Pipeline interface and MRPipeline implementation class, as you can see in below class outline:
Getting Started with Apache Crunch
 As you can see, the pipeline class contains methods to read and write collections. These collection classes have methods to perform operations on the contents of collections to produce a new result collection. Therefore, a pipeline consists of the definition of one or more input collections, a number of operations on these intermediary collections, and the writing of the collections to data sinks. The execution of all the actual pipeline operations is delayed until the run or done methods are called, at which point Crunch translates the pipeline into one or more MapReduce jobs and starts their execution.
The Pipeline interface defines a readTextFile method that takes in a String and returns a PCollection of Strings. In addition to text files, the library supports reading data from SequenceFiles and Avro container files, via the SequenceFileSource and AvroFileSource classes defined in the org.apache.crunch.io package. Note that each PCollection is a reference to a source of data, no data is actually loaded into a PCollection on the client machine.
  • Collections:
In Crunch the collection interface represent a distributed set of elements. A collection can be created in one of two ways: as a result of a read method invocation on the Pipeline class, or as a result of an operation on another collection. There are three types of collections in Crunch, as below figure shown:Getting Started with Apache Crunch
 Collection classes contains a number of methods, which operate on the contents of the collections, these operations are executed in either the map or reduce phase. Among them, the PGroupedTable is a special collection that's a result of calling groupByKey method on the PTable, this results in a reduce phase being executed to perform the grouping.

  • Data Functions:
Functions can be applied to the collections that you just saw using parallelDo method in the collection interface. All the parallelDo methods take a DoFn inplementation, which perform the actual operation on the collection in MapReduce, you can see the DoFn class in below figure:
Getting Started with Apache Crunch
 As you can see, all DoFn instances are required to be java.io.Serializable. This is a key aspect of the library's design: once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce job, all of the state of that DoFn is serialized so that it may be distributed to all of the nodes in the Hadoop cluster that will be running that task. There are two important implications of this for developers: 

相关推荐