Syntax Highlighter

Friday, August 6, 2010

Elastic Build System in Action

GridCentric's Copper platform enables multiple diverse applications to securely, and easily, share the same underlying physical resources. Moreover, it allows these applications to effortlessly scale up to create on-demand homogeneous clusters, and to scale back down again. My last post described how this capability is ideally situated to solve the problems with Distributed Continuous Integration (CI) servers. In this post I will show how we can make Hudson, a popular CI server, aware of the Copper platform, and to use its powerful API to dynamically create identical slave machines and become an elastic build system.

Requirements

A Copper deployment (60-day free fully featured license)

A virtual cluster named elastic-hudson created in the Copper platform. If you are unfamiliar with creating virtual clusters in Copper please take some time following the tutorials.

Configure the Virtual Cluster

This section will describe all the configuration we need to do in Copper to boot the virtual machine that will host our Hudson installation. Once the virtual machine is booted, it will be able to use the Copper API to clone itself.

As mentioned earlier I have a virtual cluster already created called elastic-hudson. For the root container I downloaded the pre-configured Ubuntu image, but there are other distros available if you would prefer something else. The only special configuration for this virtual cluster is that I am going to attach a public network to it.

Whenever a virtual machine uses the clone API, the new virtual machine will automatically be configured to be apart of a private network between all of the clones and the original machine. Since Hudson is a build system, it will need access to my Mercurial respository which is hosted on a separate machine outside this private network.

To remedy this problem, I configured my virtual cluster to also have a public network. In this case, all the virtual machines in the cluster now have access to two networks: the original private network and the public network. I will be using the private network for communication between the Hudson master and the Hudson slaves, and then the public network for the machines to communicate with the repository.



That's all the configuration needed, and now I can boot the elastic-hudson virtual cluster.

Configure Hudson

Once the elastic-hudson virtual cluster is booted we should be able to log into it. By using the built-in DNS Server that comes with the Copper platform I can access the virtual cluster by simply using the domain name elastic-hudson.clusters. Otherwise, I can use the management tools to determine the virtual machine's public IP address.

Once logged into the virtual machine I will setup a vanilla Hudson install, as I would do in any other environment. So I follow the Hudson install instructions, install the tools (e.g. Mercurial, Ant, unzip, wget, make, etc.) that are required to build my project, and finally configure Hudson to build my projects. So far we have not done anything different when operating Hudson.

Now comes the interesting part: making Hudson clone slaves of itself to create an elastic build system. Access to the Copper API is controlled using the familiar Unix permission scheme. Basically a user needs to have read/write access to /proc/xen/xenbus in order to utilize the API. By default only the root user has permission to it and since Hudson runs as the special hudson user, we need to also give this user permission to it in order for the application to clone itself. There are many ways to do this, but we are just going to give everyone access to it:

root@elastic-hudon# chmod og+rw /proc/xen/xenbus

Now that the hudson user has permission to use the Copper API, the next piece of configuration is that we have to allow Hudson passwordless ssh within the private network. Simply follow this guide and use localhost (or 10.0.0.1) for the remote machine as the hudson user. This should enable the hudson user to ssh throughout the private network without the need of a password.

Finally we are ready to finish off our setup by configuring a slave within Hudson. Log into the Hudson web interface (e.g. http://elastic-hudson-0.clusters:8080) and go to Manage Hudson -> Manage Nodes -> New Node. Enter the name of the slave (e.g. clone-slave) and select the dumb slave option and finally click ok.



In reality the only thing special about this configuration is that I've specified a custom launch script called /var/lib/hudson/clone_slave_launcher.sh. This is the script that Hudson will call when establishing a connection to the slave machine. The first thing we do in this script is use the Copper API to clone the Hudson machine, then we determine the clone's private network IP address, and finally proceed as normal to establish a connection to the newly created slave machine.

Before revealing the script the other interesting configuration is Hudson's slave availability. I have set it so that Hudson will connect to a slave (or in our case create one) if there is a pending job in the queue for more than a minute. It will then disconnect from the slave (or in our case destroy the machine) if the queue becomes empty for more than a minute. In other words, Hudson will automatically scale up its footprint when there are pending jobs it needs to complete, and then automatically scale down, freeing up resources for other users (potentially other Hudson system for different teams), once it no longer needs that many resources.

Here is the script that makes Hudson aware of the Copper platform, and performs the magic to make it into an Elastic Build System:

#!/bin/bash
###################################################################
# This script can be used with the Continuous Integration server
# Hudson to allow it to use the GridCentric Copper platform to
# create on-demand slaves that are preconfigured to be identical to
# the master, at the time the slave is created.
#
# It will also destroy the slaves once Hudson has finished with it 
# and turns it offline.
###################################################################


# The SSH command. We are not doing strict host checking because 
# these are clone machines on a private network. We are sure there
# will be no man in the middle attacks.
SSH="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"

# The amount of time (in seconds) the master should wait before 
# connecting to the slave. By default it is 5 seconds, but can be
# passed as the first argument to the script.
WAIT_BEFORE_CONNECT=5
if [ $# == 1 ]
then
  WAIT_BEFORE_CONNECT=$1
fi

# Acquire a ticket from the GridCentric Copper platform. The ticket
# basically reserves resources (CPUs, Memory, etc.) on the physical
# cluster to host our clone.
echo "Acquiring ticket for resources to clone on to..."
TICKET=`gc rt 1 1 1 60000 | awk '{ print $2 }'`

if [ ${#TICKET} == 0 ]
then
  echo "Failed to acquire ticket in 1 minute."
  # exit with an EBUSY signal
  exit 16
else
  # We reserved the resources, so now we will clone using the 
  # ticket. This is the API call that allows this virtual machine 
  # to grow its footprint.

  echo "Cloning on ticket $TICKET..."
  gc clone $TICKET

  # The clone operation is kinda like fork(), afterwards there will
  # be 2 copies of this script running. One on the master virtual 
  # machine (this one) and another on the clone virtual machine. 
  # They will both be at the instruction after the clone (i.e. here),
  # so we do a quick check to see if we are the master or clone and
  # then execute differently.

  gc ismaster

  if [ $? == 0 ]
  then

    # This is the master's task and will be executed on the same 
    # machine that the Hudson instance is running. Basically it 
    # will determine the clone's private network's IP address, wait
    # a little bit of time for the clone to take care of its setup, 
    # and then SSH to the clone and run the Hudson slave jar.
    gc ltd $TICKET
    IP_ADDRESS=`gc ltd $TICKET | grep ip | awk '{ print $2 }'`

    echo "Connecting to clone on $IP_ADDRESS after waiting for $WAIT_BEFORE_CONNECT seconds..."
    sleep $WAIT_BEFORE_CONNECT

    # This command is what lets Hudson communicate with the slave. 
    # Basically it uses the stdin and stdout for communication.
    $SSH $IP_ADDRESS java -jar /var/run/hudson/war/WEB-INF/slave.jar

  else

    # This is the slave's task that will be executed on the clone 
    # machine. It first needs to kill the Hudson process to ensure
    # no conflicts, and then it will sit and poll to see if the 
    # slave.jar job is done. Hudson just kills the script on the 
    # master, so we can't assume we'll get a signal from it. So, we
    # just poll and once the slave.jar stops executing, we 
    # 'gc join' back and kill ourselves.

    # Kill any running Hudson instance 
    HUDSON_PID=`ps aux | grep hudson.war | grep java | awk '{ print $2 }'`
    kill $HUDSON_PID

    # Kill any running SSH connection to another slave. Remember, 
    # this is a clone of the master virtual machine, so any process
    # that was running there is also running here.
    SLAVE_PIDS=$(`ps aux | grep slave.jar`)
    kill $SLAVE_PIDS

    # Just wait a bit for things to happen
    sleep 60

    # Now we periodically check if slave.jar is still running. If 
    # it is, we are good. Otherwise, we are finished.
    while true;
    do
      SLAVE_PID=`ps aux | grep slave.jar | grep java | awk '{ print $2 }'`
      if [ ${#SLAVE_PID} == 0 ]
      then

        # Slave process has gone. We join, but since the master is 
        # not waiting for us to join, we join with 'noreport'. In 
        # other words, this virtual machine will die and the master
        # will get no report.
        gc join noreport
      else
        wait $SLAVE_PID
      fi
    done

 fi
fi

Here is what it looks like in Hudson when we launch our new slave.



Finally, we can add as many of these slaves to Hudson as we like, using the exact same configuration, if we need more machines in the cluster. These machines will never sit idle because Hudson will create them only when it needs them, then destroy them when it is done with them, and they will be guanteed to be properly configured because they are exact replicas of the master machine at the instance they are created.

0 comments:

Post a Comment