One of the key benefits of Copper is that it allows very different applications to easily and securely share physical resources. We have standard packages that make grid queueing and MPI integration dead simple, but I figured it might be valuable to show off a newer paradigm, like Hadoop. This continues what Kannan started, demonstrating how Copper makes deploying real-world services simple.
Hadoop is quickly gaining traction as a powerful tool for data analysis. A key benefit of Hadoop is that it integrates data management and a map-reduce engine, so that analysis and data processing jobs can be scheduled and performed in a data-aware fashion (e.g. computation goes to where the data is). This post will show a relatively simple setup, without integrating Copper's local storage containers (I'll reserve the right to do this step in a future post! :).
Requirements
I'll assume that like me, you have a virtual cluster named 'hadoop'. If you don't know how to create a virtual cluster, you can see the tutorials. I just used the cluster-in-a-box script to fire up a new Ubuntu cluster.
Install Cloudera Hadoop Packages
The first step is to jump in to our virtual cluster (using gc-vm console or via ssh) and install the Cloudera Hadoop distribution. I'm using Ubuntu Jaunty, so I'll be following the Cloudera instructions for debian-based distributions (it's a small delta for the RPM-based ones). First, we'll edit /etc/apt/sources.list:
- Make sure that the Ubuntu repositories include universe and multiverse.
- Add the Cloudera repositories:
deb http://archive.cloudera.com/debian jaunty-cdh3 contrib
Next, we'll install all the Hadoop packages (you may have to type 'Y' to accept unsigned packages or you can go through the Cloudera instructions and install the key for their repository).
sudo apt-get updatesudo apt-get install hadoop-0.20-conf-pseudo
That was easy! Hadoop is installed and ready to run in pseudo-distributed mode. Don't run it just yet though. Psuedo-distributed mode involves all the services running on a single node. Although that's a good start, I really want to run it across dozens of machines in the next couple of minutes.
Tweak Configuration Files
Before we edit files to create a fully distributed Hadoop, we need to know a few things. Each Hadoop cluster requires exactly one namenode and one jobtracker. These services co-ordinate the activities of the datanodes and tasktrackers, which host storage and co-ordinate computation respectively. In our setup, the virtual cluster master will host the namenode and jobtracker, and the clones will each run a datanode and a tasktracker. We will automate this so that whenever we create new clones, they will just start the appropriate services and join the Hadoop cluster!
First I'm going to edit /etc/hadoop/conf/mapred-site.xml to ensure that the tasktracker that runs on the clones always talks to the jobtracker on our master. In our configuration, our master always gets the address 10.0.0.1 on its virtual cluster private interface. So, in mapred-site.xml, I simply changed localhost:8021 to 10.0.0.1:8021. It now reads:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>10.0.0.1:8021</value> </property> </configuration>
Second, I will do something similiar for the datanodes. Because we will run the namenode on the master of the virtual cluster, we need to edit /etc/hadoop/conf/core-site.xml so that the datanodes on the clones talk back to the master. As seen in my new core-site.xml below, I changed hdfs://localhost:8020 to be hdfs://10.0.0.1:8020 :
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://10.0.0.1:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/${user.name}</value> </property> </configuration>
In the above file, I also changed the "hadoop.tmp.dir" to be "/data/hadoop/${user.name}". This is really a move for later, when I hook in Copper local storage containers. For now I'll just create this directory and make sure that the permissions are correct.
root@ubuntu# mkdir -p /data/hadoop/hadooproot@ubuntu# chown hadoop:hadoop /data/hadoop/hadoop
Start Services
Finally, before starting the namenode and jobtracker on the master, I need to format the files used for the namenode.
At last, I start-up the namenode and jobtracker on the master.
root@ubuntu# /etc/init.d/hadoop-*-namenode start root@ubuntu# /etc/init.d/hadoop-*-jobtracker start
Now I am running the services that will stay on the master. On each of the clones that I create, I would like to have a tasktracker and a datanode running. In order to automate this from the begining, I simply install a post-clone hook by creating the file /etc/gridcentric/post-clone-on-clone.d/hadoop with the following contents and making sure that it is executable.
/etc/init.d/hadoop-*-namenode stop /etc/init.d/hadoop-*-jobtracker stop /etc/init.d/hadoop-*-datanode start /etc/init.d/hadoop-*-tasktracker start
Creating Your Cluster and Running Jobs
I am ready to create my hadoop cluster. To create four clones running datanodes and tasktrackers, I simply run:
root@ubuntu# gc clone 4
I wait for 30 seconds or so for the datanodes and tasktrackers to co-ordinate with the master and then I run sample Hadoop applications!
To grow my virtual cluster, I can simply clone again. When I am finished, I run gc killall to remove my clones and free up the resources for someone else. That was pretty easy!
1 comments: