Wednesday, December 10, 2014

Sentiment Analysis on Twitter with R


 In the previoust post we explain how to install R in Ubuntu. R offers a wide variety of options to do lots of interesting and fun things. And this post shows you precisely how to do them.

1. How to get data from twitter?

So the first thing to do is get some data from twitter.

There are two primary ways to obtain data. In order of complexity these are:

a) Using the R package "twitteR"
b) Using the R package "XML"


2. Using the R package "twitterR"

You don´t have to download it from a website, you can do it directly from within R.

You can to it with:


> install.packages('twitteR', dependencies=T)

You then have to select a CRAN mirror, from where you want to download it and click ok.
R will now download the package and install it. If you see some errors maybe this article can help you.


Then we have to activate it for our current session with:

> library(twitteR)
Loading required package: ROAuth
Loading required package: RCurl
Loading required package: bitops
Loading required package: rjson

> library(plyr)
Error in library(plyr) : there is no package called ‘plyr’

Try setting your repo to a different mirror like this:

> options(repos="http://streaming.stat.iastate.edu/CRAN")

or use any other mirror of your choice.

Then try loading plyr:

> install.packages("plyr")
> library("plyr") 


> options(repos="http://cran.rstudio.com/bin/linux/ubuntu precise/")


3. Twitter authentication

First we need to create an app at Twitter.


Go to https://apps.twitter.com and log in with your Twitter Account.

Once you have created your application...

Continue to R and type in the following lines:

> reqURL <- "https://api.twitter.com/oauth/request_token"
> accessURL <- "https://api.twitter.com/oauth/access_token"
> authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "yourconsumerkey"
consumerSecret <- "yourconsumersecret" 

You have to replace yourconsumerkey and yourconsumersecret with the data provided on your app page on Twitter, still opened in your webbrowser.

twitCred <-   OAuthFactory$new(consumerKey=consumerKey,consumerSecret=consumerSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
twitCred$handshake(cainfo="cacert.pem")

You should see something like that:

To enable the connection, please direct your web browser to:
https://api.twitter.com/oauth/authorize?oauth_token=xxxxxxxxxxxxxxxxxxxxxx
When complete, record the PIN given to you and provide it here:

registerTwitterOAuth(twitCred)

4. Processing tweets data via twitteR

Let's collect some tweets containing the term "C.I.A torture"

# collect tweets in english containing 'C.I.A torture'
> tweets = searchTwitter("C.I.A torture", n=200, cainfo="cacert.pem")

To be able to analyze our tweets, we have to extract their text and save it into the variable tweets_content by typing:

> tweets_content = laply(tweets,function(t)t$getText())

What we also need are our lists with the positive and the negative words. We can find them here

After download the words, now we have to load the words in variables to use them by typing:

> neg= scan('/path/negative-words.txt', what='character', comment.char=';')
> pos= scan('/path/positive-words.txt', what='character', comment.char=';')

> install.packages("stringr")

Now we have to insert a small algorhytm written by Jeffrey Breen analyzing our words.

Just copy-paste the following lines and hit enter:

#function to calculate number of words in each category within a sentence
> score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
    require(plyr)
    require(stringr)
     
    # we got a vector of sentences. plyr will handle a list
    # or a vector as an "l" for us
    # we want a simple array ("a") of scores back, so we use 
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, pos.words, neg.words) {
         
        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)
        # and convert to lower case:
        sentence = tolower(sentence)

        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

        # compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)
     
        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)

        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)

        return(score)
    }, pos.words, neg.words, .progress=.progress )

    scores.df = data.frame(score=scores, text=sentences)
    return(scores.df)
}


> analysis = score.sentiment(tweets_content , pos, neg)

Very Negative (rating -5 or -4)
Negative (rating -3, -2, or -1)
Positive (rating 1, 2, or 3)
Very Positive (rating 4 or 5)


You can get a table by typing:

>  table(analysis$score)

Or the mean by typing:

>  mean(analysis$score)

Or get a histogram with:

>  hist(analysis$score)




R Statistical Computing: Installing in Ubuntu



R is a free software environment for statistical computation and graphics.  It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

Steps to installing this on Ubuntu:

1. Uninstall Previous R-base installation

$ sudo apt-get remove r-base-core

2. Update Sources.List File

$ sudo gedit /etc/apt/sources.list

Add the following line: deb http://cran.rstudio.com/bin/linux/ubuntu precise/

3. Add the public keys

$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
$ sudo add-apt-repository ppa:marutter/rdev

4. Install R-base

$ sudo apt-get update
$ sudo apt-get upgrade

$ sudo apt-get install r-base

5. Launch R-base

$ R

You should see something like that:

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

If you like this post, you could be interested in this other post: Sentiment Analysis on Twitter with R

Tuesday, December 09, 2014

Configuring single node Storm Cluster.


Apache Storm is a distributed real-time computation system for processing fast, large streams of data, adding real-time data processing to Apache Hadoop.
A social analytics company called BackType acquired by Twitter developed Storm. You can read more about it in this tutorial.

Storm installation can be separated into three parts as follows.


1. ZOOKEEPER CLUSTER INSTALLATION 

Zookeeper is the coordinator for Storm cluster. The interaction between nimbus and worker nodes is done through the Zookeeper.

Get the Zookeeper Download the zookeeper setup

$ wget http://www.eng.lsu.edu/mirrors/apache/zookeeper/stable/zookeeper-3.4.6.tar.gz
$ tar -xvf zookeeper-3.4.6.tar.gz
$ mv zookeeper-3.4.6 zookeeper

Optionally :
    a)  Add ZOOKEEPER_HOME under .bashrc
    b)  Add ZOOKEEPER_HOME/bin to the PATH

export ZOOKEEPER_HOME=/home/hduser/zookeeper
export PATH=$ZOOKEEPER_HOME/bin:$PATH


Create the data folder and update the conf/zoo.cfg to point to the data folder. By default it is set to /tmp folder which will be cleansed with every boot. Rest of the default settings are good enough.

$ mkdir zookeeper-data/   
$ cd zookeeper/conf
$ sudo mv zoo_sample.cfg zoo.cfg
$ sudo nano zoo.cfg


tickTime

the basic time unit in milliseconds used by Zookeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.

dataDir

the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.

clientPort

the port to listen for client connections


Now your Zookeeper cluster is ready to start.

dataDir=/home/hduser/zookeeper-data


Verify that you are able to start the zookeeper server  :

$ cd ..
$ bin/zkServer.sh start

JMX enabled by default
Using config: /home/hduser/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... ./zkServer.sh: line 109: ./zookeeper.out: Permission denied
STARTED

I resolved this error by telling zookeeper where I wanted the log file to be placed.

$ sudo nano zkEnv.sh

Add this assignment at the top of the file:

ZOO_LOG_DIR=/var/log/zookeeper

Then create that directory:

$ sudo mkdir /var/log/zookeeper
$ sudo chown zookeeper /var/log/zookeeper

$ bin/zkServer.sh start
JMX enabled by default
Using config: /home/hduser/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

The problem is resolved!!!

$ jps
4117 Jps
3982 QuorumPeerMain



2. INSTALL NATIVE DEPENDENCIES

Storm internally uses ZeroMQ. Download the code for zeromq, compile and install it. Be careful with the versions

$ wget http://download.zeromq.org/zeromq-4.0.5.tar.gz
$ tar -xzf zeromq-4.0.5.tar.gz
$ cd zeromq-4.0.5
$ mv zeromq-4.0.5 zeromq
$ ./configure
$ make
$ sudo make install

Install the git and libtool packages from the terminal. This are the prerequisites for the next step.

$ sudo apt-get install libtool git

Download the code for jzmq. These are the Java bindings for zeromq. Compile and install it.

$ sudo git clone https://github.com/nathanmarz/jzmq.git
$ cd jzmq

$ sed -i 's/classdist_noinst.stamp/classnoinst.stamp/g' src/Makefile.am
$ ./autogen.sh 
./configure
$ make
sudo make install


3. STORM INSTALLATION


Now we are all set with the installation of Storm. Download the latest Storm and extract it. 


wget 'http://people.apache.org/~ptgoetz/apache-storm-0.9.3-rc1/apache-storm-0.9.3-rc1.tar.gz'
$ tar -xzvf apache-storm-0.9.3-rc1.tar.gz

$ mv apache-storm-0.9.3-rc1 storm

Now you need to configure Storm so you need to create the Storm configuration file called ‘storm.yaml’ and its present in the ‘conf’ folder of the untar Storm root folder.

$ nano conf/storm.yaml

storm.zookeeper.servers:

- "localhost"

storm.local.dir: "/home/hduser/storm/data"

nimbus.host: "localhost"

supervisor.slots.ports:

- 6700


- 6701


4. RUN STORM TOPOLOGY 

Start the cluster


A) Start the Zookeeper cluster

To start the Zookeeper server go to the ‘bin’ directory of the Zookeeper installation and execute following command.

sudo sh zkServer.sh start



B) Start the Storm daemons

The nimbus service is similar to JobTracker and the supervisor service is similar to TaskTracker in Hadoop. More details about the Storm terminology are specified here.

bin/storm nimbus

bin/storm supervisor



C) Start the Storm UI

bin/storm ui

Use the Web UI to check the logs for any exceptions. Go to the Storm UI at http://localhost:8080.



5. UPLOAD TOPOLOGY

To upload topology to Storm Cluster go to the ‘bin’ directory of the Storm installation and execute following command

$ storm jar <path-to-topology-jar> <class-with-the-main> <arg1> <arg2> <argN>

where:

<path-to-topology-jar>: is the complete path to the complied jar where your topology code and all your libraries are.

<class-with-the-main>: will be the class in jar file having main method where the StormSubmitter is executed


<arg1> <arg2> <argN>:  the rest of the arguments will be the params that receive our main method.


With the Storm running on a single node, now it's time to execute the sample code. Get the sample code from Git.

git clone https://github.com/nathanmarz/storm-starter.git

$ cd storm-starter

Package the code. storm-starter-*.jar would be created after a successful build in the target folder.

$ mvn -f m2-pom.xml package

Execute the WordCountTopology example. The job is submitted to Storm and the control returns back immediately. The last parameter WordCount is the topology name which can be observed in the Storm UI. Check the logs for any exceptions.

bin/storm jar /home/hduser/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology WordCount

You can use the following command to check the src code of word counting:

~/storm-starter/src/jvm/storm/starter$ less WordCountTopology.java

Friday, December 05, 2014

Analyse Tweets using Flume, Hadoop and Hive



In this post we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API  to get the Tweets. The service is free, but requires the user to register for the service. We will quickly summarize how to get data into HDFS using Flume and start doing some analytics using Hive.



1. Twitter API

You need to create a Twitter app to have the consumer key, consumer secret, access token, and access token secret.

2.  Configure Flume

Assuming that Hadoop, Hive and Flume have already been installed and configured (see previous posts), download the flume-sources-1.0-SNAPSHOT.jar.

From command line (assume flume-sources-1.0-SNAPSHOT.jar is in your ~):

$ sudo cp ~/flume-sources-1.0-SNAPSHOT.jar /usr/lib/flume

Add it to the flume class path as shown below in the conf/flume-env.sh file:

FLUME_CLASSPATH="/usr/lib/flume/flume-sources-1.0-SNAPSHOT.jar"

The jar contains the java classes to pull the Tweets and save them into HDFS.

3. Configure Agents

The conf/flume.conf should have all the agents (flume, channel and hdfs) defined as below:


TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

The consumerKey, consumerSecret, accessToken and accessTokenSecret have to be replaced with those obtained from here. And,  TwitterAgent.sinks.HDFS.hdfs.path should point to the NameNode and the location in HDFS where the tweets will go to.

4. Start flume entering the next command

$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

Maybe you are going to see an error similar to this one:

AM ERROR org.apache.flume.lifecycle.LifecycleSupervisor
Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:139)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
1:08:39.826 AM WARN org.apache.flume.lifecycle.LifecycleSupervisor
Component EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:STOP} } stopped, since it could not besuccessfully started due to missing dependencies


If it is case, then you must do the next:

You need to recompile flume-sources-1.0-SNAPSHOT.jar from the https://github.com/cloudera/cdh-twitter-example

Install Maven, then download the repository of cdh-twitter-example.

$ cd flume-sources

$ mvn package

$ cd ..

Copy the new .jar in  /usr/lib/flume.

This problem happened when the twitter4j version updated from 2.2.6 to 3.X, they removed the method setIncludeEntities, and the JAR is not up to date.

By default, NameNode Web Interface (HDFS layer) is available at http://localhost:50070/. Here you can see the tweets in this case in the folder user/flume/tweets.


5.  Configure Hive


$ cd /home/hduser/hive/

Modify the conf/hive-site.xml to include the locations of the NameNode and the JobTracker as below

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:54310</value>
     </property>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:54311</value>
     </property>
</configuration>

Download hive-serdes-1.0-SNAPSHOT.jar to the lib directory in Hive. Twitter returns Tweets in the JSON format and this library will help Hive understand the JSON format

Start the Hive shell using the hive command and register the hive-serdes-1.0-SNAPSHOT.jar file downloaded earlier.

Edit the file hive-env.sh and add:

export HIVE_AUX_JARS_PATH="/home/hduser/hive/lib/hive-serdes-1.0-SNAPSHOT.jar"

Or you can edit it directly in the query

hive> ADD JAR /home/hduser/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;

6. Now, create the tweets table in Hive

CREATE EXTERNAL TABLE tweets (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>>,
   entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
      screen_name:STRING,
      name:STRING,
      friends_count:INT,
      followers_count:INT,
      statuses_count:INT,
      verified:BOOLEAN,
      utc_offset:INT,
      time_zone:STRING>,
   in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';

7. Playing with Hive.

Now that we have the data in HDFS and the table created in Hive, lets run some queries in Hive.

One of the way to determine who is the most influential person in a particular field is to to figure out whose tweets are re-tweeted the most.

$ hive
hive>

Give enough time for Flume to collect Tweets from Twitter to HDFS and then run the below query in Hive to determine the most influential person.

hive> SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;


Similarly to know which user has the most number of followers, the below query helps.

hive> select user.screen_name, user.followers_count c from tweets order by c;

If you have read this post maybe you are interested in this article

Thursday, December 04, 2014

Apache Flume


Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.It has a simple and flexible architecture based on streaming data flows.




Flume is configured by defining endpoints in a data flow called sources and sinks. The source produces events (eg, Twitter Streaming API), and the sink writes the events out to a location. Between source and the sink, there is channel. Source sends data to sink through channel.

Installation

1. Download last stable release of Apache Flume

$ sudo wget http://www.apache.org/dyn/closer.cgi/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

2. Create the Flume directory hierarchy:

$ tar -xzf apache-flume-1.5.2-bin.tar.gz
$ mv apache-flume-1.5.2-bin flume
$ sudo mv flume/ /usr/lib/
$ sudo chmod -R 777 /usr/lib/flume
$ cd /usr/lib/flume

3. Configuration 

$ nano ~/.bashrc

Add this lines to .bashrc:

#BEGIN CONFIGURATION FLUME
export FLUME_HOME=/usr/lib/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export FLUME_CLASSPATH=$FLUME_CONF_DIR
export PATH=$FLUME_HOME/bin:$PATH
#END CONFIGURATION FLUME

$ source ~/.bashrc
$ cd /usr/lib/flume/conf
$ mv flume-env.sh.template flume-env.sh

In file flume-env.sh add:

JAVA_HOME=/usr/lib/jvm/jdk1.7.0_71





And that's all.

$ /usr/lib/flume/bin/flume-ng version

You shoud see something like this:

Flume 1.5.2
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 229442aa6835ee0faa17e3034bcab42754c460f5
Compiled by hshreedharan on Wed Nov 12 12:51:22 PST 2014
From source with checksum 837f81bd1e304a65fcaf8e5f692b3f18

Maybe you could be interested in this other post.

Installing Apache Hive


The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Installing Apache Hive

In the previous post we installed hadoop 1.2.1.
$ su hduser

1. Prerequisites 

$ java -version
$ hadoop version
$ jps


2. Download Apache Hive

$ sudo wget http://apache.mirrors.hoobly.com/hive/stable/apache-hive-0.14.0-bin.tar.gz

3. Create the Hive directory hierarchy:

$ cd  /usr/local/hadoop/bin
$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir /user/hive/warehouse
$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /user/hive/warehouse
$ hadoop fs -chmod 777 /tmp/hive

4. Configuration

$ sudo tar -xzvf apache-hive-0.14.0-bin.tar.gz
$ mv apache-hive-0.14.0-bin hive
$ cd hive
$ pwd
$ export HIVE_HOME=/home/hduser/hive
$ export PATH=$HIVE_HOME/bin:$PATH
hduser@ubuntu:~/hive$ hive

You should see something like this:

Logging initialized using configuration in jar:file:/home/hduser/hive/lib/hive-common-0.14.0.jar!/hive-log4j.properties
hive>

If you have problems to run apache-hive-0.14.0, maybe this link can help you.

hive> show tables;
OK
Time taken: 3.511 seconds


Wednesday, December 03, 2014

Running Hadoop on Ubuntu Linux (Single-Node Cluster)


In this post I will describe the required steps for setting up a single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu.

The main goal is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

It has been tested with the following software versions:
  • Ubuntu 12.04
  • Hadoop 1.2.1

Hadoop

1. Make sure you have the Java JDK

Hadoop requires a working Java 1.5+ (aka Java 5) installation

$ java -version
java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)

OpenJDK Client VM (build 23.25-b01, mixed mode, sharing)

2. Adding a dedicated Hadoop system user

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ su - hduser

3. Configuring SSH

Hadoop requires SSH access to manage its nodes. In this case we need to configure SSH access to localhost for the hduser user we created in the point 2.

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
22:2d:b1:fa:07:62:b2:b9:a9:9d:fc:3a:67:1e:48:b6 hduser@ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|    .            |
|     +           |
|  o + o S        |
|.oo+.o .         |
| =E...           |
|o+.oo..          |
|=.=B+.           |
+-----------------+

Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

Second, you have to enable SSH access to your local machine with this newly created key.

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user.

$ ssh localhost

If it fails, maybe this article helps you.


5. Download a Hadoop version

I downloaded:  1.2.1. 

$ tar xfz hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 /usr/local/hadoop
$ chown hduser:hadoop -R /usr/local/hadoop/hadoop-1.2.1

6. Add the following lines to the end of the  ~/.bashrc

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"


#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.2.1
export PATH=$PATH:$HADOOP_INSTALL/bin
#HADOOP VARIABLES END

7. Configuration

Our goal in this tutorial is a single-node setup of Hadoop:

  • hadoop-env.sh
Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/hadoop-1.2.1/conf/hadoop-env.sh) and set the JAVA_HOME environment variable:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

  • conf/*-site.xml
We will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter. This parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:


<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>


In file conf/hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

8. Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this post). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)! 

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:

$ sudo /usr/local/hadoop/hadoop-1.2.1/bin/hadoop namenode -format

You should see something like:

14/12/03 03:40:17 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.6.0_33
************************************************************/
14/12/03 03:40:17 INFO util.GSet: Computing capacity for map BlocksMap
14/12/03 03:40:17 INFO util.GSet: VM type       = 32-bit
14/12/03 03:40:17 INFO util.GSet: 2.0% max memory = 1013645312
14/12/03 03:40:17 INFO util.GSet: capacity      = 2^22 = 4194304 entries
14/12/03 03:40:17 INFO util.GSet: recommended=4194304, actual=4194304
14/12/03 03:40:18 INFO namenode.FSNamesystem: fsOwner=root
14/12/03 03:40:18 INFO namenode.FSNamesystem: supergroup=supergroup
14/12/03 03:40:18 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/12/03 03:40:18 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/12/03 03:40:18 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/12/03 03:40:18 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/12/03 03:40:18 INFO namenode.NameNode: Caching file names occuring more than 10 times 
14/12/03 03:40:18 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
14/12/03 03:40:18 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/12/03 03:40:18 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/12/03 03:40:18 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
14/12/03 03:40:18 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

9. Starting your single-node cluster


hduser@ubuntu:~$ /usr/local/hadoop/hadoop-1.2.1/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

 You should see something like:

starting namenode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

 A tool for checking whether the expected Hadoop processes are running is jps

hduser@ubuntu:/usr/local/hadoop/hadoop-1.2.1$ jps



You can also check with netstat if Hadoop is listening on the configured ports.

hduser@ubuntu:~$ sudo netstat -plten | grep java



If you want to stop the your cluste you must enter:

hduser@ubuntu:~$ /usr/local/hadoop/hadoop-1.2.1/bin/stop-all.sh

You should see something like this:

stopping jobtracker
localhost: stopping tasktracker
no namenode to stop
localhost: no datanode to stop
localhost: stopping secondarynamenode


Maybe you could be interested in this other post about Sentiment Analysis.

Monday, December 01, 2014

Apache Mahout: Scalable machine learning library

Apache Mahout is a project of the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input (machine learning). Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. 

  • Taste
Taste is the Recommender System part of Mahout and it provides a very consistent and flexible collaborative filtering engine. Mahout provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms. The package defines the following interfaces:
  1. DataModel
  2. UserSimilarity
  3. ItemSimilarity
  4. UserNeighborhood
  5. Recommender




This diagram shows the relationship between various Mahout components in a user-based recommender.


  • Installation (ubuntu)

Here follow a step-by-step guide to install and test the Mahout recommender system.

1. Make sure you have the Java JDK. 


$ java -version
java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)
OpenJDK Client VM (build 23.25-b01, mixed mode, sharing)

2. Install the project manager Maven

$  mvn -version


Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Sun Microsystems Inc.
Java home: /usr/lib/jvm/java-6-openjdk-i386/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.8.0-44-generic", arch: "i386", family: "unix"

3. Download a Hadoop version

I downloaded:  1.2.1. Be careful with this, with Hadoop 2 you can get problems with the Mahout's version


$ tar xfz hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 /usr/local/hadoop


4. Download the Mahout package 

I downloaded the version 0.9: mahout-distribution-0.9-src.tar.gz

5. Unpack mahout-distribution-0.9-src.tar.gz

$ cd /opt/
$ tar -xvzf mahout-distribution-0.9-src.tar.gz 
$ sudo mvn install

With this, you will have compiled Mahout's code, and run the UnitTests that comes with it, to make sure everything is ok with the component.

If the build was sucessfull you should see something like:

[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 55 minutes 23 seconds
[INFO] Finished at: Tue Dec 01 10:15:02 BRT 2014
[INFO] Final Memory: 60M/275M
[INFO] ------------------------------------------------------------------------


6. Now use gedit(or your favored editor) to edit ~/.bashrc using the following command:

$ gedit ~/.bashrc

This will open the .bashrc file in a text editor. Go to the end of the file and paste/type the following content in it:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.2.1
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"
#HADOOP VARIABLES END

7. Executing Recomender 

 Get your data on the following format:

userid, itemid, rating

$ cd /opt/mahout-distribution-0.9

For example, copy the following data and name it as mydata.dat  where you have installed Mahout:

1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0 

Now you need to create the file users.dat in the same folder. 

$ chmod 7777 -R /opt/mahout-distribution-0.9

Now, run:

$ bin/mahout recommenditembased --input mydata.dat --usersFile users.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION

The usersFile is where you should put for which users you want to o the recommendation for. You can change numRecommendations to the number of recommendations you desire.







Friday, January 03, 2014

New features Rails 4



New features Rails 4
·          

  • Deprecated dynamic finders: Rails 4 deprecates all dynamic finder methods (with the exception of find_by and find_by_…). Instead, you’ll use where:


find_all_by_... => where(...)
scoped_by_... => where(...)
find_last_by_... => where(...).last
find_or_initialize_by... => where(...).first_or_initialize
find_or_create_by... =>  can be rewritten using find_or_create_by(...) or where(...).first_or_create
find_or_create_by_...! => can be rewritten using find_or_create_by!(...) or where(...).first_or_create!


  • Renamed Callbacks: Action callbacks in controllers are now renamed from *_filter to *_action

Example: 
before_filter >> before_action
after_filter  >> after_action

  • Routing Concerns: In Rails 4, routing concerns has been added to the router. The basic idea is to define common sub-resources (like comments) as concerns and include them in other resources/routes.

Rails3 code
               resources :posts do
                               resources :comments
               end 
               resources :articles do
                               resources :comments
                               resources :remarks
               end
Rails4 code
               concern :commentable do
                               resources :comments
               end

               concern :remarkable do
                               resources :remarks
               end

               resources :posts, :concerns => :commentable
               resources :articles, :concerns => [:commentable, :remarkable]



  •  Datetypes:  Here are all the Rails 4 (ActiveRecord migration) datatypes:

:binary
:boolean
:date
:datetime
:decimal
:float
:integer
:primary_key
:references
:string
:text
:time
:timestamp
If you use PostgreSQL, you can also take advantage of these:
:hstore
:array
:cidr_address
:ip_address
:mac_address
They are stored as strings if you run your app with a not-PostgreSQL database.



  •  Queuing system: Rails4 added the support for a queue to run background jobs. The queuing API is very simple. The ActiveSupport::Queue class comes with a push method that accepts any object, as long as that object defines a run method.

Example:

class TestJob
def run
                puts "I am running!"
end
end

You can queue a job to print “I am running!” in the background by pushing an instance of that class to Rails.queue:
Rails.queue.push(TestJob.new)
=> "I am running!"
·        

  •  Strong Parameters: In Rails 4, a new pattern has been introduced to secure your models from mass assignment. You can filter the parameters passed to your model in the controller instead of ‘whitelisting’ the attributes in your model using “attr_accessible”.

class PostController < ApplicationController
def create
@post = Post.create(params[:user])
...
end
end

You can protect against unexpected input with declarations in the model using “attr_accessible”.

attr_accessible :title, :description

In Rails 4, you don’t need to worry about unexpected input attributes in your model anymore.
Strong Parameters gem moves user input into the controller.

class PostController < ApplicationController

def create
@post = Post.create(post_params)
...
end

private
def post_params
params.require(:post).permit(:title, :description)
end
end

The “require” method ensures that the specified key is available in the “params” hash, and raises an ActionController::ParameterMissing exception if the key doesn’t exist.
The “permit” method protects you from unexpected mass assignment.


  • Custom Flash Types: You can register your own flash types to use in redirect_to calls and in templates. For example:

      # app/controllers/application_controller.rb
      class ApplicationController
            add_flash_types :error, :catastrophe
      end

      # app/controllers/things_controller.rb
      class ThingsController < ApplicationController
            def create
                  # ... create a thing
            rescue Error => e
                  redirect_to some_path, :error => e.message
            rescue Catastrophe => e
                  redirect_to another_path, :catastrophe => e.message
            end
      end

      # app/views/layouts/application.html.erb
      <div class="error"><%= error %></div>
      <div class="catastrophe"><%= catastrophe %></div>