Thursday, July 02, 2015

Apache Pig

Apache Pig is a platform for analyzing large data sets. 

 StepTask       Command Result
 1 Download Apache Pig 0.15.0Point browser to
mv ~/Downloads/pig-0.15.0.tar.gz to $HADOOP_HOME
 3 Extract pig-0.15.0.tar.gz tar -xzvf pig-0.15.0.tar.gz Should see pig-0.15.0 directory
 4Add PIG_HOME and update PATH  Add the following lines to ~/.bash_profile

export PIG_HOME=$HADOOP_HOME/pig-0.15.0

export PATH=$PATH:$PIG_HOME/bin

source ~/.bash_profile
 5  Make sure pig is setup properlypig -version Should see Apache Pig version 0.15.0 o n the console
6Now you can play with pig

The Pig execution environment has two modes:

  • Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.

  • Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster.

Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:

  • Script: Simply a file containing Pig Latin commands, identified by the .pig suffix (for example, file.pig or myscript.pig). The commands are interpreted by Pig and executed in sequential order.

  • Grunt: Grunt is a command interpreter. You can type Pig Latin on the grunt command line and Grunt will execute the command on your behalf. This is very useful for prototyping and “what if” scenarios.

  • Embedded: Pig programs can be executed as part of a Java program.

Now that we have installed Apache Pig, we can play with it. I used it to filter large log files and get the information that I needed, in this case I was interested only in the errors. You can create a file with the extension .pig:

messages = LOAD '$input';
out = FILTER messages BY $0 MATCHES '^+.*error+.*';
STORE out INTO '$output';

Here you can see a small list of operators that you can use in Pig:

FILTERSelect a set of tuples from a relation based on a condition.
FOREACHIterate the tuples of a relation, generating a data transformation.
GROUPGroup the data in one or more relations.
JOINJoin two or more relations (inner or outer join).
LOADLoad data from the file system.
ORDERSort a relation based on one or more fields.
SPLITPartition a relation into two or more relations.
STOREStore data in the file system.

No comments:

Post a Comment