Syntax Highlighter

Friday, July 30, 2010

Howto: Build a Hadoop cluster in five minutes

One of the key benefits of Copper is that it allows very different applications to easily and securely share physical resources. We have standard packages that make grid queueing and MPI integration dead simple, but I figured it might be valuable to show off a newer paradigm, like Hadoop. This continues what Kannan started, demonstrating how Copper makes deploying real-world services simple.

Hadoop is quickly gaining traction as a powerful tool for data analysis. A key benefit of Hadoop is that it integrates data management and a map-reduce engine, so that analysis and data processing jobs can be scheduled and performed in a data-aware fashion (e.g. computation goes to where the data is). This post will show a relatively simple setup, without integrating Copper's local storage containers (I'll reserve the right to do this step in a future post! :).

Requirements

A GridCentric Copper deployment (free trial available here, software downloadable here).

I'll assume that like me, you have a virtual cluster named 'hadoop'. If you don't know how to create a virtual cluster, you can see the tutorials. I just used the cluster-in-a-box script to fire up a new Ubuntu cluster.

Install Cloudera Hadoop Packages

The first step is to jump in to our virtual cluster (using gc-vm console or via ssh) and install the Cloudera Hadoop distribution. I'm using Ubuntu Jaunty, so I'll be following the Cloudera instructions for debian-based distributions (it's a small delta for the RPM-based ones). First, we'll edit /etc/apt/sources.list:
  1. Make sure that the Ubuntu repositories include universe and multiverse.
  2. Add the Cloudera repositories:
deb http://archive.cloudera.com/debian jaunty-cdh3 contrib

Next, we'll install all the Hadoop packages (you may have to type 'Y' to accept unsigned packages or you can go through the Cloudera instructions and install the key for their repository).
sudo apt-get update
sudo apt-get install hadoop-0.20-conf-pseudo
That was easy! Hadoop is installed and ready to run in pseudo-distributed mode. Don't run it just yet though. Psuedo-distributed mode involves all the services running on a single node. Although that's a good start, I really want to run it across dozens of machines in the next couple of minutes.

Tweak Configuration Files

Before we edit files to create a fully distributed Hadoop, we need to know a few things. Each Hadoop cluster requires exactly one namenode and one jobtracker. These services co-ordinate the activities of the datanodes and tasktrackers, which host storage and co-ordinate computation respectively. In our setup, the virtual cluster master will host the namenode and jobtracker, and the clones will each run a datanode and a tasktracker. We will automate this so that whenever we create new clones, they will just start the appropriate services and join the Hadoop cluster!

First I'm going to edit /etc/hadoop/conf/mapred-site.xml to ensure that the tasktracker that runs on the clones always talks to the jobtracker on our master. In our configuration, our master always gets the address 10.0.0.1 on its virtual cluster private interface. So, in mapred-site.xml, I simply changed localhost:8021 to 10.0.0.1:8021. It now reads:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.0.0.1:8021</value>
</property>
</configuration>
Second, I will do something similiar for the datanodes. Because we will run the namenode on the master of the virtual cluster, we need to edit /etc/hadoop/conf/core-site.xml so that the datanodes on the clones talk back to the master. As seen in my new core-site.xml below, I changed hdfs://localhost:8020 to be hdfs://10.0.0.1:8020:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.0.1:8020</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/${user.name}</value>
</property>
</configuration>
In the above file, I also changed the "hadoop.tmp.dir" to be "/data/hadoop/${user.name}". This is really a move for later, when I hook in Copper local storage containers. For now I'll just create this directory and make sure that the permissions are correct.
root@ubuntu# mkdir -p /data/hadoop/hadoop
root@ubuntu# chown hadoop:hadoop /data/hadoop/hadoop
Start Services

Finally, before starting the namenode and jobtracker on the master, I need to format the files used for the namenode.
root@ubuntu# su - hadoop -c "hadoop namenode -format" 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu.gridcentric.ca/192.168.1.76
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2+320
STARTUP_MSG:   build =  -r 9b72d268a0b590b4fd7d13aca17c1c453f8bc957; compiled by 'root' on Mon Jun 28 23:15:26 UTC 2010
************************************************************/
Re-format filesystem in /var/lib/hadoop-0.20/cache/hadoop/dfs/name ? (Y or N) Y
10/07/30 18:09:29 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
10/07/30 18:09:29 INFO namenode.FSNamesystem: supergroup=supergroup
10/07/30 18:09:29 INFO namenode.FSNamesystem: isPermissionEnabled=false
10/07/30 18:09:29 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/07/30 18:09:29 INFO common.Storage: Storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name has been successfully formatted.
10/07/30 18:09:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu.gridcentric.ca/192.168.1.76
************************************************************/
At last, I start-up the namenode and jobtracker on the master.
root@ubuntu# /etc/init.d/hadoop-*-namenode start
root@ubuntu# /etc/init.d/hadoop-*-jobtracker start
Now I am running the services that will stay on the master. On each of the clones that I create, I would like to have a tasktracker and a datanode running. In order to automate this from the begining, I simply install a post-clone hook by creating the file /etc/gridcentric/post-clone-on-clone.d/hadoop with the following contents and making sure that it is executable.
/etc/init.d/hadoop-*-namenode    stop
/etc/init.d/hadoop-*-jobtracker  stop
/etc/init.d/hadoop-*-datanode    start
/etc/init.d/hadoop-*-tasktracker start
Creating Your Cluster and Running Jobs

I am ready to create my hadoop cluster. To create four clones running datanodes and tasktrackers, I simply run:
root@ubuntu# gc clone 4
I wait for 30 seconds or so for the datanodes and tasktrackers to co-ordinate with the master and then I run sample Hadoop applications!
root@ubuntu:~# hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps  = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
10/07/29 22:31:21 INFO mapred.FileInputFormat: Total input paths to process : 2
10/07/29 22:31:21 INFO mapred.JobClient: Running job: job_201007292045_0008
10/07/29 22:31:22 INFO mapred.JobClient:  map 0% reduce 0%
10/07/29 22:31:51 INFO mapred.JobClient:  map 100% reduce 0%                      
10/07/29 22:32:12 INFO mapred.JobClient:  map 100% reduce 100%
10/07/29 22:32:22 INFO mapred.JobClient: Job complete: job_201007292045_0008
10/07/29 22:32:22 INFO mapred.JobClient: Counters: 18
10/07/29 22:32:22 INFO mapred.JobClient:   Job Counters
10/07/29 22:32:22 INFO mapred.JobClient:     Launched reduce tasks=1
10/07/29 22:32:22 INFO mapred.JobClient:     Launched map tasks=2
10/07/29 22:32:22 INFO mapred.JobClient:     Data-local map tasks=2
10/07/29 22:32:22 INFO mapred.JobClient:   FileSystemCounters
10/07/29 22:32:22 INFO mapred.JobClient:     FILE_BYTES_READ=50
10/07/29 22:32:22 INFO mapred.JobClient:     HDFS_BYTES_READ=236
10/07/29 22:32:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=170
10/07/29 22:32:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
...

To grow my virtual cluster, I can simply clone again.  When I am finished, I run gc killall to remove my clones and free up the resources for someone else.  That was pretty easy!

1 comments:

  1. Hi,
    nice article. I have a small doubt. I am not able to edit the *-site.xml files in the etc/hadoop/conf/ folder. it is saying i dont have permissions. can you help me in this regard...?
    ReplyDelete