Wednesday, December 10, 2014

Sentiment Analysis on Twitter with R

 In the previoust post we explain how to install R in Ubuntu. R offers a wide variety of options to do lots of interesting and fun things. And this post shows you precisely how to do them.

1. How to get data from twitter?

So the first thing to do is get some data from twitter.

There are two primary ways to obtain data. In order of complexity these are:

a) Using the R package "twitteR"
b) Using the R package "XML"

2. Using the R package "twitterR"

You don´t have to download it from a website, you can do it directly from within R.

You can to it with:

> install.packages('twitteR', dependencies=T)

You then have to select a CRAN mirror, from where you want to download it and click ok.
R will now download the package and install it. If you see some errors maybe this article can help you.

Then we have to activate it for our current session with:

> library(twitteR)
Loading required package: ROAuth
Loading required package: RCurl
Loading required package: bitops
Loading required package: rjson

> library(plyr)
Error in library(plyr) : there is no package called ‘plyr’

Try setting your repo to a different mirror like this:

> options(repos="")

or use any other mirror of your choice.

Then try loading plyr:

> install.packages("plyr")
> library("plyr") 

> options(repos=" precise/")

3. Twitter authentication

First we need to create an app at Twitter.

Go to and log in with your Twitter Account.

Once you have created your application...

Continue to R and type in the following lines:

> reqURL <- ""
> accessURL <- ""
> authURL <- ""
consumerKey <- "yourconsumerkey"
consumerSecret <- "yourconsumersecret" 

You have to replace yourconsumerkey and yourconsumersecret with the data provided on your app page on Twitter, still opened in your webbrowser.

twitCred <-   OAuthFactory$new(consumerKey=consumerKey,consumerSecret=consumerSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
download.file(url="", destfile="cacert.pem")

You should see something like that:

To enable the connection, please direct your web browser to:
When complete, record the PIN given to you and provide it here:


4. Processing tweets data via twitteR

Let's collect some tweets containing the term "C.I.A torture"

# collect tweets in english containing 'C.I.A torture'
> tweets = searchTwitter("C.I.A torture", n=200, cainfo="cacert.pem")

To be able to analyze our tweets, we have to extract their text and save it into the variable tweets_content by typing:

> tweets_content = laply(tweets,function(t)t$getText())

What we also need are our lists with the positive and the negative words. We can find them here

After download the words, now we have to load the words in variables to use them by typing:

> neg= scan('/path/negative-words.txt', what='character', comment.char=';')
> pos= scan('/path/positive-words.txt', what='character', comment.char=';')

> install.packages("stringr")

Now we have to insert a small algorhytm written by Jeffrey Breen analyzing our words.

Just copy-paste the following lines and hit enter:

#function to calculate number of words in each category within a sentence
> score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
    # we got a vector of sentences. plyr will handle a list
    # or a vector as an "l" for us
    # we want a simple array ("a") of scores back, so we use 
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, pos.words, neg.words) {
        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)
        # and convert to lower case:
        sentence = tolower(sentence)

        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

        # compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)
        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !
        neg.matches = !

        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)

    }, pos.words, neg.words, .progress=.progress )

    scores.df = data.frame(score=scores, text=sentences)

> analysis = score.sentiment(tweets_content , pos, neg)

Very Negative (rating -5 or -4)
Negative (rating -3, -2, or -1)
Positive (rating 1, 2, or 3)
Very Positive (rating 4 or 5)

You can get a table by typing:

>  table(analysis$score)

Or the mean by typing:

>  mean(analysis$score)

Or get a histogram with:

>  hist(analysis$score)

R Statistical Computing: Installing in Ubuntu

R is a free software environment for statistical computation and graphics.  It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

Steps to installing this on Ubuntu:

1. Uninstall Previous R-base installation

$ sudo apt-get remove r-base-core

2. Update Sources.List File

$ sudo gedit /etc/apt/sources.list

Add the following line: deb precise/

3. Add the public keys

$ sudo apt-key adv --keyserver --recv-keys E084DAB9
$ sudo add-apt-repository ppa:marutter/rdev

4. Install R-base

$ sudo apt-get update
$ sudo apt-get upgrade

$ sudo apt-get install r-base

5. Launch R-base

$ R

You should see something like that:

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


If you like this post, you could be interested in this other post: Sentiment Analysis on Twitter with R

Tuesday, December 09, 2014

Configuring single node Storm Cluster.

Apache Storm is a distributed real-time computation system for processing fast, large streams of data, adding real-time data processing to Apache Hadoop.
A social analytics company called BackType acquired by Twitter developed Storm. You can read more about it in this tutorial.

Storm installation can be separated into three parts as follows.


Zookeeper is the coordinator for Storm cluster. The interaction between nimbus and worker nodes is done through the Zookeeper.

Get the Zookeeper Download the zookeeper setup

$ wget
$ tar -xvf zookeeper-3.4.6.tar.gz
$ mv zookeeper-3.4.6 zookeeper

Optionally :
    a)  Add ZOOKEEPER_HOME under .bashrc
    b)  Add ZOOKEEPER_HOME/bin to the PATH

export ZOOKEEPER_HOME=/home/hduser/zookeeper

Create the data folder and update the conf/zoo.cfg to point to the data folder. By default it is set to /tmp folder which will be cleansed with every boot. Rest of the default settings are good enough.

$ mkdir zookeeper-data/   
$ cd zookeeper/conf
$ sudo mv zoo_sample.cfg zoo.cfg
$ sudo nano zoo.cfg


the basic time unit in milliseconds used by Zookeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.


the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.


the port to listen for client connections

Now your Zookeeper cluster is ready to start.


Verify that you are able to start the zookeeper server  :

$ cd ..
$ bin/ start

JMX enabled by default
Using config: /home/hduser/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... ./ line 109: ./zookeeper.out: Permission denied

I resolved this error by telling zookeeper where I wanted the log file to be placed.

$ sudo nano

Add this assignment at the top of the file:


Then create that directory:

$ sudo mkdir /var/log/zookeeper
$ sudo chown zookeeper /var/log/zookeeper

$ bin/ start
JMX enabled by default
Using config: /home/hduser/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

The problem is resolved!!!

$ jps
4117 Jps
3982 QuorumPeerMain


Storm internally uses ZeroMQ. Download the code for zeromq, compile and install it. Be careful with the versions

$ wget
$ tar -xzf zeromq-4.0.5.tar.gz
$ cd zeromq-4.0.5
$ mv zeromq-4.0.5 zeromq
$ ./configure
$ make
$ sudo make install

Install the git and libtool packages from the terminal. This are the prerequisites for the next step.

$ sudo apt-get install libtool git

Download the code for jzmq. These are the Java bindings for zeromq. Compile and install it.

$ sudo git clone
$ cd jzmq

$ sed -i 's/classdist_noinst.stamp/classnoinst.stamp/g' src/
$ ./ 
$ make
sudo make install


Now we are all set with the installation of Storm. Download the latest Storm and extract it. 

wget ''
$ tar -xzvf apache-storm-0.9.3-rc1.tar.gz

$ mv apache-storm-0.9.3-rc1 storm

Now you need to configure Storm so you need to create the Storm configuration file called ‘storm.yaml’ and its present in the ‘conf’ folder of the untar Storm root folder.

$ nano conf/storm.yaml


- "localhost"

storm.local.dir: "/home/hduser/storm/data" "localhost"


- 6700

- 6701


Start the cluster

A) Start the Zookeeper cluster

To start the Zookeeper server go to the ‘bin’ directory of the Zookeeper installation and execute following command.

sudo sh start

B) Start the Storm daemons

The nimbus service is similar to JobTracker and the supervisor service is similar to TaskTracker in Hadoop. More details about the Storm terminology are specified here.

bin/storm nimbus

bin/storm supervisor

C) Start the Storm UI

bin/storm ui

Use the Web UI to check the logs for any exceptions. Go to the Storm UI at http://localhost:8080.


To upload topology to Storm Cluster go to the ‘bin’ directory of the Storm installation and execute following command

$ storm jar <path-to-topology-jar> <class-with-the-main> <arg1> <arg2> <argN>


<path-to-topology-jar>: is the complete path to the complied jar where your topology code and all your libraries are.

<class-with-the-main>: will be the class in jar file having main method where the StormSubmitter is executed

<arg1> <arg2> <argN>:  the rest of the arguments will be the params that receive our main method.

With the Storm running on a single node, now it's time to execute the sample code. Get the sample code from Git.

git clone

$ cd storm-starter

Package the code. storm-starter-*.jar would be created after a successful build in the target folder.

$ mvn -f m2-pom.xml package

Execute the WordCountTopology example. The job is submitted to Storm and the control returns back immediately. The last parameter WordCount is the topology name which can be observed in the Storm UI. Check the logs for any exceptions.

bin/storm jar /home/hduser/storm-starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.starter.WordCountTopology WordCount

You can use the following command to check the src code of word counting:

~/storm-starter/src/jvm/storm/starter$ less

Friday, December 05, 2014

Analyse Tweets using Flume, Hadoop and Hive

In this post we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API  to get the Tweets. The service is free, but requires the user to register for the service. We will quickly summarize how to get data into HDFS using Flume and start doing some analytics using Hive.

1. Twitter API

You need to create a Twitter app to have the consumer key, consumer secret, access token, and access token secret.

2.  Configure Flume

Assuming that Hadoop, Hive and Flume have already been installed and configured (see previous posts), download the flume-sources-1.0-SNAPSHOT.jar.

From command line (assume flume-sources-1.0-SNAPSHOT.jar is in your ~):

$ sudo cp ~/flume-sources-1.0-SNAPSHOT.jar /usr/lib/flume

Add it to the flume class path as shown below in the conf/ file:


The jar contains the java classes to pull the Tweets and save them into HDFS.

3. Configure Agents

The conf/flume.conf should have all the agents (flume, channel and hdfs) defined as below:

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

The consumerKey, consumerSecret, accessToken and accessTokenSecret have to be replaced with those obtained from here. And,  TwitterAgent.sinks.HDFS.hdfs.path should point to the NameNode and the location in HDFS where the tweets will go to.

4. Start flume entering the next command

$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

Maybe you are going to see an error similar to this one:

AM ERROR org.apache.flume.lifecycle.LifecycleSupervisor
Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(
at org.apache.flume.source.EventDrivenSourceRunner.start(
at org.apache.flume.lifecycle.LifecycleSupervisor$
at java.util.concurrent.Executors$
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(
at java.util.concurrent.FutureTask.runAndReset(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(
at java.util.concurrent.ScheduledThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
at java.util.concurrent.ThreadPoolExecutor$
1:08:39.826 AM WARN org.apache.flume.lifecycle.LifecycleSupervisor
Component EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:STOP} } stopped, since it could not besuccessfully started due to missing dependencies

If it is case, then you must do the next:

You need to recompile flume-sources-1.0-SNAPSHOT.jar from the

Install Maven, then download the repository of cdh-twitter-example.

$ cd flume-sources

$ mvn package

$ cd ..

Copy the new .jar in  /usr/lib/flume.

This problem happened when the twitter4j version updated from 2.2.6 to 3.X, they removed the method setIncludeEntities, and the JAR is not up to date.

By default, NameNode Web Interface (HDFS layer) is available at http://localhost:50070/. Here you can see the tweets in this case in the folder user/flume/tweets.

5.  Configure Hive

$ cd /home/hduser/hive/

Modify the conf/hive-site.xml to include the locations of the NameNode and the JobTracker as below


Download hive-serdes-1.0-SNAPSHOT.jar to the lib directory in Hive. Twitter returns Tweets in the JSON format and this library will help Hive understand the JSON format

Start the Hive shell using the hive command and register the hive-serdes-1.0-SNAPSHOT.jar file downloaded earlier.

Edit the file and add:

export HIVE_AUX_JARS_PATH="/home/hduser/hive/lib/hive-serdes-1.0-SNAPSHOT.jar"

Or you can edit it directly in the query

hive> ADD JAR /home/hduser/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;

6. Now, create the tweets table in Hive

   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
   entities STRUCT<
   text STRING,
   user STRUCT<
   in_reply_to_screen_name STRING
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';

7. Playing with Hive.

Now that we have the data in HDFS and the table created in Hive, lets run some queries in Hive.

One of the way to determine who is the most influential person in a particular field is to to figure out whose tweets are re-tweeted the most.

$ hive

Give enough time for Flume to collect Tweets from Twitter to HDFS and then run the below query in Hive to determine the most influential person.

hive> SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;

Similarly to know which user has the most number of followers, the below query helps.

hive> select user.screen_name, user.followers_count c from tweets order by c;

If you have read this post maybe you are interested in this article

Thursday, December 04, 2014

Apache Flume

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.It has a simple and flexible architecture based on streaming data flows.

Flume is configured by defining endpoints in a data flow called sources and sinks. The source produces events (eg, Twitter Streaming API), and the sink writes the events out to a location. Between source and the sink, there is channel. Source sends data to sink through channel.


1. Download last stable release of Apache Flume

$ sudo wget

2. Create the Flume directory hierarchy:

$ tar -xzf apache-flume-1.5.2-bin.tar.gz
$ mv apache-flume-1.5.2-bin flume
$ sudo mv flume/ /usr/lib/
$ sudo chmod -R 777 /usr/lib/flume
$ cd /usr/lib/flume

3. Configuration 

$ nano ~/.bashrc

Add this lines to .bashrc:

export FLUME_HOME=/usr/lib/flume

$ source ~/.bashrc
$ cd /usr/lib/flume/conf
$ mv

In file add:


And that's all.

$ /usr/lib/flume/bin/flume-ng version

You shoud see something like this:

Flume 1.5.2
Source code repository:
Revision: 229442aa6835ee0faa17e3034bcab42754c460f5
Compiled by hshreedharan on Wed Nov 12 12:51:22 PST 2014
From source with checksum 837f81bd1e304a65fcaf8e5f692b3f18

Maybe you could be interested in this other post.

Installing Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Installing Apache Hive

In the previous post we installed hadoop 1.2.1.
$ su hduser

1. Prerequisites 

$ java -version
$ hadoop version
$ jps

2. Download Apache Hive

$ sudo wget

3. Create the Hive directory hierarchy:

$ cd  /usr/local/hadoop/bin
$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir /user/hive/warehouse
$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /user/hive/warehouse
$ hadoop fs -chmod 777 /tmp/hive

4. Configuration

$ sudo tar -xzvf apache-hive-0.14.0-bin.tar.gz
$ mv apache-hive-0.14.0-bin hive
$ cd hive
$ pwd
$ export HIVE_HOME=/home/hduser/hive
$ export PATH=$HIVE_HOME/bin:$PATH
hduser@ubuntu:~/hive$ hive

You should see something like this:

Logging initialized using configuration in jar:file:/home/hduser/hive/lib/hive-common-0.14.0.jar!/

If you have problems to run apache-hive-0.14.0, maybe this link can help you.

hive> show tables;
Time taken: 3.511 seconds

Wednesday, December 03, 2014

Running Hadoop on Ubuntu Linux (Single-Node Cluster)

In this post I will describe the required steps for setting up a single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu.

The main goal is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

It has been tested with the following software versions:
  • Ubuntu 12.04
  • Hadoop 1.2.1


1. Make sure you have the Java JDK

Hadoop requires a working Java 1.5+ (aka Java 5) installation

$ java -version
java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)

OpenJDK Client VM (build 23.25-b01, mixed mode, sharing)

2. Adding a dedicated Hadoop system user

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ su - hduser

3. Configuring SSH

Hadoop requires SSH access to manage its nodes. In this case we need to configure SSH access to localhost for the hduser user we created in the point 2.

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/
The key fingerprint is:
22:2d:b1:fa:07:62:b2:b9:a9:9d:fc:3a:67:1e:48:b6 hduser@ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|    .            |
|     +           |
|  o + o S        |
|.oo+.o .         |
| =E...           |
|o+.oo..          |
|=.=B+.           |

Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

Second, you have to enable SSH access to your local machine with this newly created key.

$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user.

$ ssh localhost

If it fails, maybe this article helps you.

5. Download a Hadoop version

I downloaded:  1.2.1. 

$ tar xfz hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 /usr/local/hadoop
$ chown hduser:hadoop -R /usr/local/hadoop/hadoop-1.2.1

6. Add the following lines to the end of the  ~/.bashrc

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.2.1

7. Configuration

Our goal in this tutorial is a single-node setup of Hadoop:

Open conf/ in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/hadoop-1.2.1/conf/ and set the JAVA_HOME environment variable:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

  • conf/*-site.xml
We will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter. This parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

  <description>A base for other temporary directories.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>

In file conf/mapred-site.xml:

  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

In file conf/hdfs-site.xml:

  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.

8. Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this post). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)! 

To format the filesystem (which simply initializes the directory specified by the variable), run the command:

$ sudo /usr/local/hadoop/hadoop-1.2.1/bin/hadoop namenode -format

You should see something like:

14/12/03 03:40:17 INFO namenode.NameNode: STARTUP_MSG: 
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.6.0_33
14/12/03 03:40:17 INFO util.GSet: Computing capacity for map BlocksMap
14/12/03 03:40:17 INFO util.GSet: VM type       = 32-bit
14/12/03 03:40:17 INFO util.GSet: 2.0% max memory = 1013645312
14/12/03 03:40:17 INFO util.GSet: capacity      = 2^22 = 4194304 entries
14/12/03 03:40:17 INFO util.GSet: recommended=4194304, actual=4194304
14/12/03 03:40:18 INFO namenode.FSNamesystem: fsOwner=root
14/12/03 03:40:18 INFO namenode.FSNamesystem: supergroup=supergroup
14/12/03 03:40:18 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/12/03 03:40:18 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/12/03 03:40:18 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/12/03 03:40:18 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/12/03 03:40:18 INFO namenode.NameNode: Caching file names occuring more than 10 times 
14/12/03 03:40:18 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
14/12/03 03:40:18 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/12/03 03:40:18 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/12/03 03:40:18 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
14/12/03 03:40:18 INFO namenode.NameNode: SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/

9. Starting your single-node cluster

hduser@ubuntu:~$ /usr/local/hadoop/hadoop-1.2.1/bin/

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

 You should see something like:

starting namenode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

 A tool for checking whether the expected Hadoop processes are running is jps

hduser@ubuntu:/usr/local/hadoop/hadoop-1.2.1$ jps

You can also check with netstat if Hadoop is listening on the configured ports.

hduser@ubuntu:~$ sudo netstat -plten | grep java

If you want to stop the your cluste you must enter:

hduser@ubuntu:~$ /usr/local/hadoop/hadoop-1.2.1/bin/

You should see something like this:

stopping jobtracker
localhost: stopping tasktracker
no namenode to stop
localhost: no datanode to stop
localhost: stopping secondarynamenode

Maybe you could be interested in this other post about Sentiment Analysis.

Monday, December 01, 2014

Apache Mahout: Scalable machine learning library

Apache Mahout is a project of the Apache Software Foundation that tries to build intelligent algorithms that learn from some data input (machine learning). Mahout offers algorithms in three major  areas: Clustering, Categorization and Recommender Systems. 

  • Taste
Taste is the Recommender System part of Mahout and it provides a very consistent and flexible collaborative filtering engine. Mahout provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms. The package defines the following interfaces:
  1. DataModel
  2. UserSimilarity
  3. ItemSimilarity
  4. UserNeighborhood
  5. Recommender

This diagram shows the relationship between various Mahout components in a user-based recommender.

  • Installation (ubuntu)

Here follow a step-by-step guide to install and test the Mahout recommender system.

1. Make sure you have the Java JDK. 

$ java -version
java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)
OpenJDK Client VM (build 23.25-b01, mixed mode, sharing)

2. Install the project manager Maven

$  mvn -version

Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Sun Microsystems Inc.
Java home: /usr/lib/jvm/java-6-openjdk-i386/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.8.0-44-generic", arch: "i386", family: "unix"

3. Download a Hadoop version

I downloaded:  1.2.1. Be careful with this, with Hadoop 2 you can get problems with the Mahout's version

$ tar xfz hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 /usr/local/hadoop

4. Download the Mahout package 

I downloaded the version 0.9: mahout-distribution-0.9-src.tar.gz

5. Unpack mahout-distribution-0.9-src.tar.gz

$ cd /opt/
$ tar -xvzf mahout-distribution-0.9-src.tar.gz 
$ sudo mvn install

With this, you will have compiled Mahout's code, and run the UnitTests that comes with it, to make sure everything is ok with the component.

If the build was sucessfull you should see something like:

[INFO] ------------------------------------------------------------------------
[INFO] Total time: 55 minutes 23 seconds
[INFO] Finished at: Tue Dec 01 10:15:02 BRT 2014
[INFO] Final Memory: 60M/275M
[INFO] ------------------------------------------------------------------------

6. Now use gedit(or your favored editor) to edit ~/.bashrc using the following command:

$ gedit ~/.bashrc

This will open the .bashrc file in a text editor. Go to the end of the file and paste/type the following content in it:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.2.1
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/native"

7. Executing Recomender 

 Get your data on the following format:

userid, itemid, rating

$ cd /opt/mahout-distribution-0.9

For example, copy the following data and name it as mydata.dat  where you have installed Mahout:


Now you need to create the file users.dat in the same folder. 

$ chmod 7777 -R /opt/mahout-distribution-0.9

Now, run:

$ bin/mahout recommenditembased --input mydata.dat --usersFile users.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION

The usersFile is where you should put for which users you want to o the recommendation for. You can change numRecommendations to the number of recommendations you desire.