Syntax Highlighter

Friday, July 30, 2010

Howto: Build a Hadoop cluster in five minutes

One of the key benefits of Copper is that it allows very different applications to easily and securely share physical resources. We have standard packages that make grid queueing and MPI integration dead simple, but I figured it might be valuable to show off a newer paradigm, like Hadoop. This continues what Kannan started, demonstrating how Copper makes deploying real-world services simple.

Hadoop is quickly gaining traction as a powerful tool for data analysis. A key benefit of Hadoop is that it integrates data management and a map-reduce engine, so that analysis and data processing jobs can be scheduled and performed in a data-aware fashion (e.g. computation goes to where the data is). This post will show a relatively simple setup, without integrating Copper's local storage containers (I'll reserve the right to do this step in a future post! :).

Requirements

A GridCentric Copper deployment (free trial available here, software downloadable here).

I'll assume that like me, you have a virtual cluster named 'hadoop'. If you don't know how to create a virtual cluster, you can see the tutorials. I just used the cluster-in-a-box script to fire up a new Ubuntu cluster.

Install Cloudera Hadoop Packages

The first step is to jump in to our virtual cluster (using gc-vm console or via ssh) and install the Cloudera Hadoop distribution. I'm using Ubuntu Jaunty, so I'll be following the Cloudera instructions for debian-based distributions (it's a small delta for the RPM-based ones). First, we'll edit /etc/apt/sources.list:
  1. Make sure that the Ubuntu repositories include universe and multiverse.
  2. Add the Cloudera repositories:
deb http://archive.cloudera.com/debian jaunty-cdh3 contrib

Next, we'll install all the Hadoop packages (you may have to type 'Y' to accept unsigned packages or you can go through the Cloudera instructions and install the key for their repository).
sudo apt-get update
sudo apt-get install hadoop-0.20-conf-pseudo
That was easy! Hadoop is installed and ready to run in pseudo-distributed mode. Don't run it just yet though. Psuedo-distributed mode involves all the services running on a single node. Although that's a good start, I really want to run it across dozens of machines in the next couple of minutes.

Tweak Configuration Files

Before we edit files to create a fully distributed Hadoop, we need to know a few things. Each Hadoop cluster requires exactly one namenode and one jobtracker. These services co-ordinate the activities of the datanodes and tasktrackers, which host storage and co-ordinate computation respectively. In our setup, the virtual cluster master will host the namenode and jobtracker, and the clones will each run a datanode and a tasktracker. We will automate this so that whenever we create new clones, they will just start the appropriate services and join the Hadoop cluster!

First I'm going to edit /etc/hadoop/conf/mapred-site.xml to ensure that the tasktracker that runs on the clones always talks to the jobtracker on our master. In our configuration, our master always gets the address 10.0.0.1 on its virtual cluster private interface. So, in mapred-site.xml, I simply changed localhost:8021 to 10.0.0.1:8021. It now reads:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.0.0.1:8021</value>
</property>
</configuration>
Second, I will do something similiar for the datanodes. Because we will run the namenode on the master of the virtual cluster, we need to edit /etc/hadoop/conf/core-site.xml so that the datanodes on the clones talk back to the master. As seen in my new core-site.xml below, I changed hdfs://localhost:8020 to be hdfs://10.0.0.1:8020:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.0.1:8020</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/${user.name}</value>
</property>
</configuration>
In the above file, I also changed the "hadoop.tmp.dir" to be "/data/hadoop/${user.name}". This is really a move for later, when I hook in Copper local storage containers. For now I'll just create this directory and make sure that the permissions are correct.
root@ubuntu# mkdir -p /data/hadoop/hadoop
root@ubuntu# chown hadoop:hadoop /data/hadoop/hadoop
Start Services

Finally, before starting the namenode and jobtracker on the master, I need to format the files used for the namenode.
root@ubuntu# su - hadoop -c "hadoop namenode -format" 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu.gridcentric.ca/192.168.1.76
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2+320
STARTUP_MSG:   build =  -r 9b72d268a0b590b4fd7d13aca17c1c453f8bc957; compiled by 'root' on Mon Jun 28 23:15:26 UTC 2010
************************************************************/
Re-format filesystem in /var/lib/hadoop-0.20/cache/hadoop/dfs/name ? (Y or N) Y
10/07/30 18:09:29 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
10/07/30 18:09:29 INFO namenode.FSNamesystem: supergroup=supergroup
10/07/30 18:09:29 INFO namenode.FSNamesystem: isPermissionEnabled=false
10/07/30 18:09:29 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/07/30 18:09:29 INFO common.Storage: Storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name has been successfully formatted.
10/07/30 18:09:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu.gridcentric.ca/192.168.1.76
************************************************************/
At last, I start-up the namenode and jobtracker on the master.
root@ubuntu# /etc/init.d/hadoop-*-namenode start
root@ubuntu# /etc/init.d/hadoop-*-jobtracker start
Now I am running the services that will stay on the master. On each of the clones that I create, I would like to have a tasktracker and a datanode running. In order to automate this from the begining, I simply install a post-clone hook by creating the file /etc/gridcentric/post-clone-on-clone.d/hadoop with the following contents and making sure that it is executable.
/etc/init.d/hadoop-*-namenode    stop
/etc/init.d/hadoop-*-jobtracker  stop
/etc/init.d/hadoop-*-datanode    start
/etc/init.d/hadoop-*-tasktracker start
Creating Your Cluster and Running Jobs

I am ready to create my hadoop cluster. To create four clones running datanodes and tasktrackers, I simply run:
root@ubuntu# gc clone 4
I wait for 30 seconds or so for the datanodes and tasktrackers to co-ordinate with the master and then I run sample Hadoop applications!
root@ubuntu:~# hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
Number of Maps  = 2
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
10/07/29 22:31:21 INFO mapred.FileInputFormat: Total input paths to process : 2
10/07/29 22:31:21 INFO mapred.JobClient: Running job: job_201007292045_0008
10/07/29 22:31:22 INFO mapred.JobClient:  map 0% reduce 0%
10/07/29 22:31:51 INFO mapred.JobClient:  map 100% reduce 0%                      
10/07/29 22:32:12 INFO mapred.JobClient:  map 100% reduce 100%
10/07/29 22:32:22 INFO mapred.JobClient: Job complete: job_201007292045_0008
10/07/29 22:32:22 INFO mapred.JobClient: Counters: 18
10/07/29 22:32:22 INFO mapred.JobClient:   Job Counters
10/07/29 22:32:22 INFO mapred.JobClient:     Launched reduce tasks=1
10/07/29 22:32:22 INFO mapred.JobClient:     Launched map tasks=2
10/07/29 22:32:22 INFO mapred.JobClient:     Data-local map tasks=2
10/07/29 22:32:22 INFO mapred.JobClient:   FileSystemCounters
10/07/29 22:32:22 INFO mapred.JobClient:     FILE_BYTES_READ=50
10/07/29 22:32:22 INFO mapred.JobClient:     HDFS_BYTES_READ=236
10/07/29 22:32:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=170
10/07/29 22:32:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
...

To grow my virtual cluster, I can simply clone again.  When I am finished, I run gc killall to remove my clones and free up the resources for someone else.  That was pretty easy!

Wednesday, July 28, 2010

Howto: Build a ten node memcached cluster in five minutes

Introduction

Memcached is used to cache everything from dynamically generated web pages to database query results. Multiple memcached instances are used in situations where a single memcached server does not have enough CPU or RAM to fulfill application requirements.

In a distributed setup, memcached servers run independently of each other, and have no knowledge of other servers. Memcached's design pushes the logic of dealing with multiple servers to the client, which makes things very simple on the server end.
GridCentric Copper is a bare-metal-up virtualization and management stack built to manage "virtual clusters". Copper's fast virtual machine cloning (which works similarly to fork() in UNIX, except it creates clones of entire running virtual machines in just a few seconds) can be used to quickly expand and contract a collection of memcached servers. We're going to use this to make a single memcached virtual machine, then grow that machine to a pool of 10 virtual machines in a couple seconds.

Requirements
  1. GridCentric Copper (free fully featured, 60-day trial available here, software downloadable here).
  2. The GridCentric DNS mapper (part of the Copper distribution).
  3. Enough free hardware resources to host 10 virtual machines.
Process

Create a New Virtual Cluster

First, create a new virtual cluster using either the included "cluster-in-a-box" script or via the Admin Web Console. In this walkthrough we will name our virtual cluster "memcached". Make sure to give the virtual cluster a managed public network interface, and make sure the "master only" checkbox is unchecked.

Let's boot the memcached virtual cluster.
gc-vc boot memcached
This will bring up a single virtual machine - a cluster of size 1. This virtual machine will be the basis of our memcached cluster, so we need to set it up. We get a console on the new master with:
gc-vm console memcached-0
This will give us a login prompt on the master.

Install memcached

We install memcached according to the README and configure it to listen on all addresses (edit the /etc/memcached.conf file and change '-l 127.0.0.1' to '-l 0.0.0.0'), and then start it up.

We now have a Copper virtual cluster, which for now contains a single virtual machine running memcached.

Create some clone virtual machines

To start up more memcached instances, we invoke the gc command line tool from within the virtual machine to create some clones of itself:
[ubuntu]$ gc clone 9
After a few seconds, the clone command returns, we have now have 10 virtual machines running memcached.

Note: we just scaled 1 memcached instance to 10 with zero extra configuration, in just a few seconds!

Configure the client address pool
Now we have to let the memcached clients know about the servers so that the clients can add the servers to their pool.

One easy method is just to use the GridCentric DNS service. This service runs on any computer with access to the Copper head node, and does dynamic mapping of DNS lookups to virtual clusters within Copper.

One of its particularly nice features is that it maps the name of a running virtual cluster to the list of all public managed IPs on that virtual cluster.

On some machine which has access to the GridCentric DNS service (let's say it's running on host gridcentric-headnode), we execute the following:
$ dig +short memcached @gridcentric-headnode
192.168.1.144
192.168.1.142   
192.168.1.143   
192.168.1.140   
192.168.1.146   
192.168.1.141   
192.168.1.145   
192.168.1.149   
192.168.1.147   
192.168.1.148

This returns a list of public IPs for all 10 of the memcached servers running within the virtual cluster. You could even design your web application to automatically query this DNS server and periodically update its view of the set of servers running.

That's it! You're now ready to start using memcached on your client nodes.

Homework:
Try to accomplish the above on a traditional cluster setup ;)