Wednesday, December 03, 2014

Running Hadoop on Ubuntu Linux (Single-Node Cluster)


In this post I will describe the required steps for setting up a single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu.

The main goal is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

It has been tested with the following software versions:
  • Ubuntu 12.04
  • Hadoop 1.2.1

Hadoop

1. Make sure you have the Java JDK

Hadoop requires a working Java 1.5+ (aka Java 5) installation

$ java -version
java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)

OpenJDK Client VM (build 23.25-b01, mixed mode, sharing)

2. Adding a dedicated Hadoop system user

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ su - hduser

3. Configuring SSH

Hadoop requires SSH access to manage its nodes. In this case we need to configure SSH access to localhost for the hduser user we created in the point 2.

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
22:2d:b1:fa:07:62:b2:b9:a9:9d:fc:3a:67:1e:48:b6 hduser@ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|    .            |
|     +           |
|  o + o S        |
|.oo+.o .         |
| =E...           |
|o+.oo..          |
|=.=B+.           |
+-----------------+

Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

Second, you have to enable SSH access to your local machine with this newly created key.

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user.

$ ssh localhost

If it fails, maybe this article helps you.


5. Download a Hadoop version

I downloaded:  1.2.1. 

$ tar xfz hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 /usr/local/hadoop
$ chown hduser:hadoop -R /usr/local/hadoop/hadoop-1.2.1

6. Add the following lines to the end of the  ~/.bashrc

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"


#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.2.1
export PATH=$PATH:$HADOOP_INSTALL/bin
#HADOOP VARIABLES END

7. Configuration

Our goal in this tutorial is a single-node setup of Hadoop:

  • hadoop-env.sh
Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/hadoop-1.2.1/conf/hadoop-env.sh) and set the JAVA_HOME environment variable:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

  • conf/*-site.xml
We will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter. This parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:


<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>


In file conf/hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

8. Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this post). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)! 

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:

$ sudo /usr/local/hadoop/hadoop-1.2.1/bin/hadoop namenode -format

You should see something like:

14/12/03 03:40:17 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.6.0_33
************************************************************/
14/12/03 03:40:17 INFO util.GSet: Computing capacity for map BlocksMap
14/12/03 03:40:17 INFO util.GSet: VM type       = 32-bit
14/12/03 03:40:17 INFO util.GSet: 2.0% max memory = 1013645312
14/12/03 03:40:17 INFO util.GSet: capacity      = 2^22 = 4194304 entries
14/12/03 03:40:17 INFO util.GSet: recommended=4194304, actual=4194304
14/12/03 03:40:18 INFO namenode.FSNamesystem: fsOwner=root
14/12/03 03:40:18 INFO namenode.FSNamesystem: supergroup=supergroup
14/12/03 03:40:18 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/12/03 03:40:18 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/12/03 03:40:18 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/12/03 03:40:18 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/12/03 03:40:18 INFO namenode.NameNode: Caching file names occuring more than 10 times 
14/12/03 03:40:18 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
14/12/03 03:40:18 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/12/03 03:40:18 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits
14/12/03 03:40:18 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
14/12/03 03:40:18 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

9. Starting your single-node cluster


hduser@ubuntu:~$ /usr/local/hadoop/hadoop-1.2.1/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

 You should see something like:

starting namenode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

 A tool for checking whether the expected Hadoop processes are running is jps

hduser@ubuntu:/usr/local/hadoop/hadoop-1.2.1$ jps



You can also check with netstat if Hadoop is listening on the configured ports.

hduser@ubuntu:~$ sudo netstat -plten | grep java



If you want to stop the your cluste you must enter:

hduser@ubuntu:~$ /usr/local/hadoop/hadoop-1.2.1/bin/stop-all.sh

You should see something like this:

stopping jobtracker
localhost: stopping tasktracker
no namenode to stop
localhost: no datanode to stop
localhost: stopping secondarynamenode


Maybe you could be interested in this other post about Sentiment Analysis.

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete