Syntax Highlighter

Monday, August 30, 2010

Nimble Test Clusters

I have been a big fan of Selenium since I first used it about 4 years ago. It is a very flexible web testing framework that has the advantage of actually running within a real browser. This means that it can do complete end-to-end testing starting with client side javascript code down to SQL queries hitting the database server, and it can also do cross browser testing to ensure that everything works between IE, Firefox, Safari, Chrome, etc. The test cases can be written in a wide variety of syntaxes, from a simple HTML table to full-fledged programming languages like Java or Ruby. By using Selenium it is very easy for any team building a web project to create automated end-to-end regression tests that can be kicked off by a Continuous Integration (CI) system whenever a developer commits. As I said before, I am a big fan of it.

Unfortunately, automated tests in general have a downside especially when a project matures. The number of tests increase causing the time it takes to run the test suite to increase. This either causes the test suites to become a bottleneck in the development cycle, or people just run them less frequently. Running tests less frequently has a compounding effect because tests start going stale, which increases maintenance cost on the tests, which causes them again to be run less frequently. Unfortunately, Selenium has a handicap in this realm because it runs within the actual web browser which inherently adds time to each test.


Fortunately, automated tests are a good candidate for parallel execution even though they are generally executed serially. This is great news because we can significantly reduce the time a test suite takes to execute by distributing it across a mini-cluster of computers. Both JUnit and TestNG have projects for distributing their execution, and there is the Selenium Grid project that will distribute Selenium tests. Basically the goal of the project is to build a small test cluster where Selenium tests can be farmed out onto different machines to greatly reduce the time it takes to execute the full test suite. It is definitely a good idea, and it will address the problem of stale tests and their associated maintenance cost. With it taking less time to run the suite, we can easily add it back into a CI system and run the tests more frequently.


Unfortunately, as I mentioned in my post on Elastic Build Systems, once we start to distribute our environments we open up a host of other problems. Namely, ensuring that all the machines in the cluster have been updated with the same version of Selenium, and Selenium Grid, that the browser versions being tested are properly synchronised, and in general be confident that the machines are identical: a test running on one machine is guaranteed to give the same results when executed on another machine. And of course the flip side of having tests run faster by distributing it across multiple machines means that there are more machines running idle for longer. Finally with a special consideration for Selenium it might be desirable to rerun the test suite across multiple browsers to regression test what is supported. With this in mind, your test cluster could start growing to include specialised Windows machines running IE7 and IE8, Linux machines running FireFox, etc. to ensure that you get good coverage, in addition to a speed boost. Of course, when adding more machines, the complexity of managing the cluster also increases, as well as the wasted resources of idling machines.

Fortunately, GridCentic's Copper high-performance virtualization platform makes managing a computer cluster extremely easy and allows a cluster to be re-purposed within seconds.

At the core of the Copper platform is the ability of virtual machines to live-clone themselves. In other words, Copper allows running virtual machines to instantly create multiple copies of themselves where each copy is an exact replica of the original machine, from the memory that has been loaded down to the instruction pointer in the CPU. This whole process of taking a single running virtual machine to saturating the physical limitation of your hardware with cloned virtual machines happens within seconds. In addition, the Copper platform also takes care of networking all of these virtual machines so that they can communicate with each other, and if desired they can also communicate with the rest of the network.


Lets suppose for example that we manage to scrounge together a little test cluster of 4 machines that we can use for running Selenium Grid to optimize our Selenium test runs. Ideally we would like to ensure that our web platform tests run using our supported configurations: Windows running IE8, OS X running Safari, and Linux running Firefox. Traditionally, we would divide up our cluster for these configurations with a single machine per configuration and one spare machine we could give to a single configuration. We have just lost the benefits of Selenium Grid and we have an extra computer that cannot be shared.

However, we can make use of Copper's ability to dynamically and quickly re-purpose an entire cluster. Each testing environment (operating system, browser, tools and Selenium Grid) will be encapsulated in its own virtual machine that will run on top of the little test cluster. When it becomes time to run the tests, the virtual machine will scale itself out using the fast live-cloning provided by Copper and then run Selenium Grid as normal. Once it has finished executing the tests, it will clean up its clone machines and scale back down freeing up the cluster resources. Then another test environment can scale out onto the recently freed resources, run the tests, and again scale back down.  This can be done for each environment.

Moreover, supporting a new test environment is nothing more that creating a virtual machine for it and configuring the machine by installing the test environment software. This new configuration can now scale up, run tests using Selenium Grid, and scale back down.

In a previous post I showed how Hudson can be easily transformed into an Elastic Build System. Basically, it allowed Hudson to dynamically create new slave machines that are identical to the master machine and then farm out build jobs to them. Hudson could also destroy the slaves when they were no longer needed to free up resources. Now let's combine everything together to show how we can build an ultimate continuous integration environment using the Copper platform.

There will be a single virtual machine with Hudson installed that will be our master machine. We will continue to have a virtual machine per test environment that we want to support, and we'll configure Hudson to treat each of these machines as a slave. We'll then configure some Hudson jobs that will run on our test environment slaves that will basically clone out the test environment, start up Selenium Grid, run the tests and then scale back down. We'll also configure Hudson to have a couple of elastic slaves as described in a previous post. So now we have it. A developer commits some change, Hudson farms the build out to a dynamically created slave. Once the build is completed, it can trigger the Selenium Grid testing agents in turn so that they will scale out, run the regression integration tests, scale back down and produce a test report of any possible regression issues.

Amazing.

Wednesday, August 11, 2010

Slashdotted!

Well, the Slashdot referral certainly stirred things up over here.

I think it's important to clarify some core points about our system and how it works. Adin's done a good job responding to some of the posts on Slashdot, but we have our hands full cutting the Copper 1.1 release and I figured it would be best to provide a quick overview of what we are up to here, so we can all get back to work :).

The memory oversubscription is one neat application of our core technology, which basically amounts to COW-based cloning of entire virtual machines. Our goal with Copper is to use this primitive to enable whole-cluster virtualization. We take a single physical cluster, and transparently run multiple virtual clusters on top of it. Each of these virtual clusters can grow and shrink independently (by cloning new VMs and killing off clones), to accomodate different demands.

Our ultimate aim is to take 'Virtual Machine Appliances' and turn them into 'Virtual Cluster Appliances': minimal-configuration disk images that are able to spin up a virtual cluster on demand, serve up a single app, and dynamically adjust resource usage based on whatever internal metrics they want to use.

To this end, we've exposed the cloning primitive inside the master virtual machine of Copper virtual clusters. The master VM, through either an API call or a shell tool (which basically wraps the API call), can invoke a 'clone' operation. For example, with the shell, it would look like this:

'gc clone 3'

This will request resources for 3 clone VMs from the Copper controller, create those clones, and then exit. The clones are exact duplicates of the master VM in terms of the state of memory and the root disk. Peripheral devices, external storage resources, and network configurations of the master can also be transparently replicated on the clones.

The semantics of the clone operation closely match those of the familiar UNIX process fork().
There are several additional API calls that can help with managing the resulting virtual cluster - e.g. API calls for listing all clones, killing individual clones or entire generations of clones, etc. Bindings have been created to expose these APIs in several languages so applications can make full use of Copper's capabilities. For instance, David had a very informative post on how to use Copper's APIs to create a dynamic build system (Elastic Build System in Action).

Copper's use of lightweight virtual machine cloning is deceptively powerful for enabling 'virtual cluster appliances'. For example, we can create a 'memcached' virtual cluster that just has memcached installed on the root disk, and can be instantly expanded or contracted in a matter of seconds, with no extra configuration steps. Same can be done for apps like Hadoop and Cassandra. Check out some of the earlier blog posts (Howto: Build and Scale a Cassandra Cluster,Howto: Build a Hadoop Cluster in 5 Minutes , Howto: Build a 10 Node Memcached Cluster) for enabling cluster appliances with Copper.

Virtualization is growing up, and it can't be just about single machines anymore. Copper's name was chosen to be reminiscent of 'cluster operating system', and that's really what we're trying to build at GridCentric.

Tuesday, August 10, 2010

Virtualization and over-subscription: breaking the 100% utilization barrier

Something that we think about a lot at GridCentric is how to ensure that resources are used efficiently. In fact, that's what led us to create Copper in the first place -- seeing how difficult it was to share large clusters amongst multiple users, groups or even software versions. We think that the ideas behind Infrastructure-as-a-Service have huge potential to revolutionize computing.

We realize that a lot of enterprises have focused on deriving maximum efficiency from their resources for a long time. I thought that I'd take a ground-up look at some of the mechanisms used by virtualization to do that, and introduce something that we're developing for our next products in order to take resource multiplexing to the next level.

Example and demo


Our developers are hard at work on our next generation of products, which in addition to allowing instantaneous scaling and solving configuration nightmares, allows you to over-commit your physical resources (both processors and memory). This quick demo provides a great example of over-subscription, showing a development version of our platform cramming 16 gigabytes worth of clones onto a single host with 8 gigabytes of memory. Awesome.



If you're unfamiliar with Copper, each of the clone VMs created in the video is an independent machine with its own memory, disk and network initially identical to the master at the time of cloning. Each of the 16 VMs could have been allocated to a user or used to perform a specific task independently of the rest. At the end, if we look at the resources allocated by VMs on this host within our Copper installation, we see the following:
$ gc-host stat node8
                   CPU(s) 16 of 4 (400%)
                      RAM 17408 of 5631 (309%)
               Local disk 0 of 192258490368 MB (0%)
  Available Named Storage ['kv-local(10485760000)']
       Used Named Storage []
    Available PCI Devices ['VGA(0000:01:05.0)']
         Used PCI Devices []
       Available Networks ['default', 'gridcentric']
           Available Tags ['development']

To be fair, the memory shown is without the extra overhead from the management domain (2.5 gigabytes). Pessimistically, we are achieving an actual memory over-subscription of ~212%, not 309%. Still, not too shabby.

We achieve over-subscription by leveraging our novel cloning mechanism. In this post, I take a look at over-subscription in general and touch on some of the mechanisms used to support squeezing more out of your resources. For simplicity, I will focus on two crucial computing resources which are typically managed by an operating system or hypervisor: the processor (CPU) and memory. This post assumes a basic familiarity with virtualization concepts, but it will be gentle.

CPU over-subscription with multi-processing


Traditional operating systems have been over-subscribing and multiplexing resources since the days of the mainframe. Contrary to what it might seem, multiplexing a resource may actually allow you to use it far more efficiently than just giving it to a single process or user (when you consider everything that it is doing). Multiplexing CPUs with virtual machines is a primary motivator for consolidation, since CPUs are often an underutilized component on a typical enterprise server.

Suppose we have four virtual machines (VMs) that perform some work (the dark color) followed by some waiting for an I/O event (the light color). If you had an ssh connection to a VM, this might consist of processing an incoming TCP packet, the terminal and shell doing a bit of work, generating and sending a response packet (with appropriate ACK number) then waiting for the next packet to arrive.


Virtual machines are not much different than processes in an operating system. Because most processes inside each VM are I/O-bound, VMs are also generally I/O-bound as a whole (depending, of course, on the workload). This means that VMs often spend significant time waiting for I/O. If we were to just schedule just a single VM on a CPU, the execution would look much like the following.


Instead of maximizing efficiency, the CPU would actually be doing nothing most of the time. If the VMs B, C, and C were similar and we had four distinct processors (one assigned to each), we would imagine them to running as follows.


I've staggered them to show that we could actually come up with a very efficient schedule of execution on a single CPU. Since each VM spends most of its time waiting (and the CPU is not required for I/O), we could simply overlay the above schedules.


When a virtual machine doesn't have work to do, we instead run another virtual machine. When an I/O request completes, we reschedule the appropriate VM which can once again perform useful work. We make this a bit more complex with limits in how long each virtual machine can run consecutively (the quantum), priorities, interrupt handling, gang scheduling, etc. But the above is the undeniably essence of all CPU over-commit: simple time-sharing.

Memory over-subscription


Memory over-subscription is where things get interesting. There are several different approaches that we can take in order to over-commit our physical memory.

In a normal virtualization situation, VMs each have a fixed amount of memory, the sum of which is less than or equal to the total memory available on the physical machine. Memory of the physical system is divided up into pages by the hardware, which the hypervisor can arbitrarily remap (therefore the VMs do not need contiguous memory). In a non-over-committed situation, there is an injective mapping from the memory of the VMs to the pages of physical memory. This situation is shown below, with colors corresponding to VMs and the numbers corresponding to the contents of their memory.


Paging / swap


Paging is the process in which a page is taken out of memory (possibly to a swap partition or pagefile), the corresponding page table entry is marked as not present, and the page is returned to memory on-demand. Operating systems have been paging out process memory to disk for a long time. This allows you to run more processes than you have space for or a single processes that requires more memory than the physical memory you have available. Most optimization and research around paging is focused on selecting the correct pages to remove from physical memory, as the penalty incurred for unexpectedly having to bring a page back in is quite high (disks are extremely slow compared to memory).

In the below example, we see that we want to run VMs that have more memory than is available on the physical machine. In this case, several pages from VM C have been marked not present by the underlying hypervisor. If VM C is scheduled and attempts to access these pages, it will be blocked while the system fetches those pages from secondary storage. When that happens, we can assume that another page (belonging to one of A, B or C) will be paged out to disk to make room for the incoming one.


For virtualization specifically, it's useful to divide hypervisors into two different camps: type-I hypervisors run on the bare metal (e.g. VMWare ESX, Xen) while type-II hypervisors run on top of an existing operating system (e.g. VMWare Workstation, KVM, VirtualBox).

Type-II hypervisors often leverage existing operating system infrastructure in order to provide paging for virtual machines. The fact that they are running on an existing operating system means that they can leverage many pre-existing mechanisms. For type-I hypervisors, as far as I know, only VMWare's ESX products offer a paging mechanism today. Several research projects have explored adding forms of paging to Xen, but as far as I know it's still a work-in-progress in the main tree.

Paging is a useful fall-back mechanism, and does not require modifications within guest VMs. Unfortunately, it can be very expensive (in terms of a performance hit) and may suffer from poor interactions with the guest operating systems (double paging).

Ballooning


Ballooning is a very common mechanism used to support creating VMs with more memory than the host can support. In essence, ballooning simply forces VMs to share with each other by requiring that they return some of their memory to the hypervisor. This is done by dynamically inflating a balloon driver within the guest operating system, then informing the hypervisor which pages have been allocated so that they may be used by other VMs. This is a very effective mechanism, and generally does not suffer from poor interactions with guest operating systems (unless memory requirements change rapidly). However, it is not transparent to the guest operating systems -- they are aware that they are missing memory and require the special balloon driver to be installed.

An example of ballooning is shown below. The grey pages have been allocated by the balloon driver within the guest and returned to the hypervisor (we could say that the balloons have been inflated slightly within the VMs).


Page sharing


By far the neatest of the space-saving techniques, some of VMWare's products support a mechanism they call content-based page sharing. Essentially, the hypervisor continually spends a relatively small amount of CPU time crawling memory and hashing pages. If it identifies two pages with the same contents, it maps those pages to the same underlying physical page and frees one of the copies. Of course, this single copy is marked read-only and will be copied if one of the VMs needs to change the contents (copy-on-write). Thus, if you are running a lot of similiar or identical VMs, there can be significant savings.

Similiar to paging, research projects (1, 2) have explored this capability for Xen, but it has not quite made it into mainline. In practice, savings for production workloads will likely be very small. According to the few numbers I've found from VMWare, it seems to be in 5-10% for simple workloads. Windows rewrites binaries heavily, so one is unlikely to find many similar pages outside of the kernel code pages, some fixed-address DLLs (win32 probably always gets its preferred address) and pages in the buffer cache. I suspect that most savings (in whitepapers, datasheets, etc.) likely come from identification of unused zero-pages which would also be nicely gobbled up by an automated and co-operative balloon (an example -- each VM is only using about 5% of its memory). An example of page sharing is shown below, where the hypervisor has identified the common pages and remapped appropriately, leading to significant savings.


Summary


For clarity, the three approaches for over-subscription memory that I've touched upon are outlined here.
TechniqueAdvantageDisadvantageWhitepaper symptom
Paging(Mostly) transparent.
As much as you want.
Slow.
Poor interactions with guest VMs.
They don't show how much swap is active.
BallooningCo-operative.
Safe.
Guest VM modifications required.
Limits guest VM memory.
Big VMs with tiny memory usage (big balloons).
Content-based sharingTransparent.
Negligible overhead.
Probably limited gains.VMs run the same application with the same data.

Virtualization has come a long way in delivering techniques that allow organizations to get the most out their resources. It's a very tricky problem however, and there is no magic bullet. If you see 5x over-commit with real workloads, it's almost guaranteed that there's either a lot of ballooning or a lot of paging. We'll be talking more about some of these ideas over the next few months, as we put the finishing touches on our own approach to cramming in as much stuff as possible into as little time and space as we can. :)

Friday, August 6, 2010

Elastic Build System in Action

GridCentric's Copper platform enables multiple diverse applications to securely, and easily, share the same underlying physical resources. Moreover, it allows these applications to effortlessly scale up to create on-demand homogeneous clusters, and to scale back down again. My last post described how this capability is ideally situated to solve the problems with Distributed Continuous Integration (CI) servers. In this post I will show how we can make Hudson, a popular CI server, aware of the Copper platform, and to use its powerful API to dynamically create identical slave machines and become an elastic build system.

Requirements

A Copper deployment (60-day free fully featured license)

A virtual cluster named elastic-hudson created in the Copper platform. If you are unfamiliar with creating virtual clusters in Copper please take some time following the tutorials.

Configure the Virtual Cluster

This section will describe all the configuration we need to do in Copper to boot the virtual machine that will host our Hudson installation. Once the virtual machine is booted, it will be able to use the Copper API to clone itself.

As mentioned earlier I have a virtual cluster already created called elastic-hudson. For the root container I downloaded the pre-configured Ubuntu image, but there are other distros available if you would prefer something else. The only special configuration for this virtual cluster is that I am going to attach a public network to it.

Whenever a virtual machine uses the clone API, the new virtual machine will automatically be configured to be apart of a private network between all of the clones and the original machine. Since Hudson is a build system, it will need access to my Mercurial respository which is hosted on a separate machine outside this private network.

To remedy this problem, I configured my virtual cluster to also have a public network. In this case, all the virtual machines in the cluster now have access to two networks: the original private network and the public network. I will be using the private network for communication between the Hudson master and the Hudson slaves, and then the public network for the machines to communicate with the repository.



That's all the configuration needed, and now I can boot the elastic-hudson virtual cluster.

Configure Hudson

Once the elastic-hudson virtual cluster is booted we should be able to log into it. By using the built-in DNS Server that comes with the Copper platform I can access the virtual cluster by simply using the domain name elastic-hudson.clusters. Otherwise, I can use the management tools to determine the virtual machine's public IP address.

Once logged into the virtual machine I will setup a vanilla Hudson install, as I would do in any other environment. So I follow the Hudson install instructions, install the tools (e.g. Mercurial, Ant, unzip, wget, make, etc.) that are required to build my project, and finally configure Hudson to build my projects. So far we have not done anything different when operating Hudson.

Now comes the interesting part: making Hudson clone slaves of itself to create an elastic build system. Access to the Copper API is controlled using the familiar Unix permission scheme. Basically a user needs to have read/write access to /proc/xen/xenbus in order to utilize the API. By default only the root user has permission to it and since Hudson runs as the special hudson user, we need to also give this user permission to it in order for the application to clone itself. There are many ways to do this, but we are just going to give everyone access to it:

root@elastic-hudon# chmod og+rw /proc/xen/xenbus

Now that the hudson user has permission to use the Copper API, the next piece of configuration is that we have to allow Hudson passwordless ssh within the private network. Simply follow this guide and use localhost (or 10.0.0.1) for the remote machine as the hudson user. This should enable the hudson user to ssh throughout the private network without the need of a password.

Finally we are ready to finish off our setup by configuring a slave within Hudson. Log into the Hudson web interface (e.g. http://elastic-hudson-0.clusters:8080) and go to Manage Hudson -> Manage Nodes -> New Node. Enter the name of the slave (e.g. clone-slave) and select the dumb slave option and finally click ok.



In reality the only thing special about this configuration is that I've specified a custom launch script called /var/lib/hudson/clone_slave_launcher.sh. This is the script that Hudson will call when establishing a connection to the slave machine. The first thing we do in this script is use the Copper API to clone the Hudson machine, then we determine the clone's private network IP address, and finally proceed as normal to establish a connection to the newly created slave machine.

Before revealing the script the other interesting configuration is Hudson's slave availability. I have set it so that Hudson will connect to a slave (or in our case create one) if there is a pending job in the queue for more than a minute. It will then disconnect from the slave (or in our case destroy the machine) if the queue becomes empty for more than a minute. In other words, Hudson will automatically scale up its footprint when there are pending jobs it needs to complete, and then automatically scale down, freeing up resources for other users (potentially other Hudson system for different teams), once it no longer needs that many resources.

Here is the script that makes Hudson aware of the Copper platform, and performs the magic to make it into an Elastic Build System:

#!/bin/bash
###################################################################
# This script can be used with the Continuous Integration server
# Hudson to allow it to use the GridCentric Copper platform to
# create on-demand slaves that are preconfigured to be identical to
# the master, at the time the slave is created.
#
# It will also destroy the slaves once Hudson has finished with it 
# and turns it offline.
###################################################################


# The SSH command. We are not doing strict host checking because 
# these are clone machines on a private network. We are sure there
# will be no man in the middle attacks.
SSH="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"

# The amount of time (in seconds) the master should wait before 
# connecting to the slave. By default it is 5 seconds, but can be
# passed as the first argument to the script.
WAIT_BEFORE_CONNECT=5
if [ $# == 1 ]
then
  WAIT_BEFORE_CONNECT=$1
fi

# Acquire a ticket from the GridCentric Copper platform. The ticket
# basically reserves resources (CPUs, Memory, etc.) on the physical
# cluster to host our clone.
echo "Acquiring ticket for resources to clone on to..."
TICKET=`gc rt 1 1 1 60000 | awk '{ print $2 }'`

if [ ${#TICKET} == 0 ]
then
  echo "Failed to acquire ticket in 1 minute."
  # exit with an EBUSY signal
  exit 16
else
  # We reserved the resources, so now we will clone using the 
  # ticket. This is the API call that allows this virtual machine 
  # to grow its footprint.

  echo "Cloning on ticket $TICKET..."
  gc clone $TICKET

  # The clone operation is kinda like fork(), afterwards there will
  # be 2 copies of this script running. One on the master virtual 
  # machine (this one) and another on the clone virtual machine. 
  # They will both be at the instruction after the clone (i.e. here),
  # so we do a quick check to see if we are the master or clone and
  # then execute differently.

  gc ismaster

  if [ $? == 0 ]
  then

    # This is the master's task and will be executed on the same 
    # machine that the Hudson instance is running. Basically it 
    # will determine the clone's private network's IP address, wait
    # a little bit of time for the clone to take care of its setup, 
    # and then SSH to the clone and run the Hudson slave jar.
    gc ltd $TICKET
    IP_ADDRESS=`gc ltd $TICKET | grep ip | awk '{ print $2 }'`

    echo "Connecting to clone on $IP_ADDRESS after waiting for $WAIT_BEFORE_CONNECT seconds..."
    sleep $WAIT_BEFORE_CONNECT

    # This command is what lets Hudson communicate with the slave. 
    # Basically it uses the stdin and stdout for communication.
    $SSH $IP_ADDRESS java -jar /var/run/hudson/war/WEB-INF/slave.jar

  else

    # This is the slave's task that will be executed on the clone 
    # machine. It first needs to kill the Hudson process to ensure
    # no conflicts, and then it will sit and poll to see if the 
    # slave.jar job is done. Hudson just kills the script on the 
    # master, so we can't assume we'll get a signal from it. So, we
    # just poll and once the slave.jar stops executing, we 
    # 'gc join' back and kill ourselves.

    # Kill any running Hudson instance 
    HUDSON_PID=`ps aux | grep hudson.war | grep java | awk '{ print $2 }'`
    kill $HUDSON_PID

    # Kill any running SSH connection to another slave. Remember, 
    # this is a clone of the master virtual machine, so any process
    # that was running there is also running here.
    SLAVE_PIDS=$(`ps aux | grep slave.jar`)
    kill $SLAVE_PIDS

    # Just wait a bit for things to happen
    sleep 60

    # Now we periodically check if slave.jar is still running. If 
    # it is, we are good. Otherwise, we are finished.
    while true;
    do
      SLAVE_PID=`ps aux | grep slave.jar | grep java | awk '{ print $2 }'`
      if [ ${#SLAVE_PID} == 0 ]
      then

        # Slave process has gone. We join, but since the master is 
        # not waiting for us to join, we join with 'noreport'. In 
        # other words, this virtual machine will die and the master
        # will get no report.
        gc join noreport
      else
        wait $SLAVE_PID
      fi
    done

 fi
fi

Here is what it looks like in Hudson when we launch our new slave.



Finally, we can add as many of these slaves to Hudson as we like, using the exact same configuration, if we need more machines in the cluster. These machines will never sit idle because Hudson will create them only when it needs them, then destroy them when it is done with them, and they will be guanteed to be properly configured because they are exact replicas of the master machine at the instance they are created.

Thursday, August 5, 2010

Howto: Build and scale a Cassandra cluster in five minutes

Introduction

In the spirit of the software recipes we've been posting recently, I decided to give Cassandra a shot. Cassandra is a distributed key-value store that was open-sourced by Facebook in 2008 and is now under the umbrella of the Apache foundation. It advertises high-performance and robustness to individual node failures while providing eventual consistency. It's been receiving quite a bit of attention recently, and Riptano has appeared to provide commercial support.

Cassandra is a bit of an odd name for a piece of software in my opinion. Cassandra is a figure from Greek mythology, who was both blessed with the ability to see the future and the curse that no one would ever believe her predictions. I struggled for a while to come up with some hilarious software-equivalent joke, but I think we are all better off without one.

I spent a bit of time and installed Cassandra on a Copper virtual cluster. I then wrote a few quick scripts to automatically have clones join a Cassandra cluster when they are created automatically. GridCentric Copper supports persistent local storage through an awesome VFS abstraction. As with Hadoop, I've omitted this part of the setup for now to simplify this post, but I reserve the right to make that post in the near future.

Cassandra does not have different classes of nodes, so I had no need to run anything special on the master of the virtual cluster. Cassandra only requires a seed node, so that a freshly started instance can learn about the others. I use the master for this purpose, since we can assume that it's always around. Other than that, this article details an unmodified Cassandra installation running inside a virtual cluster -- with transient nodes that can be created and destroyed in seconds.

Requirements

A GridCentric Copper deployment (free trial available here, software downloadable here).

Just as in my Hadoop post, I'll assume that you have a virtual cluster already. If you don't know how to create a virtual cluster, you can see the tutorials. I just used the cluster-in-a-box script to fire up a new Ubuntu cluster.

Installing Packages

Cassandra needs a Java6 JRE installed and configured. In Ubuntu, I just did an apt-get install sun-java6-jre, but the exact steps will vary depending on your distribution. The next step is to grab the latest Cassandra packages. Choose a mirror from
http://www.apache.org/dyn/closer.cgi?path=/cassandra/0.6.4/apache-cassandra-0.6.4-bin.tar.gz. Once I had downloaded the latest Cassandra, I extracted the contents and moved the directory to /opt, so I could make a nice little system that corresponds to the Linux Standard Base (or at least closer).
wget <url>
tar -zxvf apache-cassandra-0.6.4-bin.tar.gz
mv apache-cassandra-0.6.4 /opt

Cassandra is actually ready to go with the default configuration and your Java install can be tested at this point. Simply run /opt/apache-cassandra-0.6.4/bin/cassandra -f and you should see it start up. You can press Ctrl-C to kill it, because we want to write a few more scripts and make a cluster.

Tweaking files and writing Scripts

Okay, so one annoying thing about Cassandra is that you need to specify an address inside the configuration file for binding. This doesn't lend itself well to any kind of cluster environment and it actually confused me -- an interface would make much more sense here. We need to abstract away this detail so we can tune the bind address on the clone. In order to do this, we copy /opt/apache-cassandra-0.6.4/conf/storage-conf.xml to /opt/apache-cassandra-0.6.4/conf/storage-conf.xml.in and change the following keys in our new storage-conf.xml.in:
...
  <Seeds>
      <Seed>127.0.0.1</Seed>
  </Seeds>
  ...
  <ListenAddress>localhost</ListenAddress>
  ...
to:
...
  <Seeds>
      <Seed>10.0.0.1</Seed>
  </Seeds>
  ...
  <ListenAddress>PRIVATE_ADDRESS</ListenAddress>
  ...
I changed the Seed key because in our virtual cluster I am going to use the master as the seed. I changed the ListenAddress to PRIVATE_ADDRESS because we are going to write a quick script that replaces that token and generates a unique storage-conf.xml on each clone with the correct addresses.

In order to write that script, I need a quick helper. First, I create a script named get-private-ip and put it in /usr/local/bin.
#!/bin/bash
ifconfig eth0| grep inet | cut -d: -f2| cut -d' ' -f1

Next, I create
recreate-cassandra-config in /usr/local/bin to actually grab the IP we want, and recreate storage-conf.xml from our tokenized storage-conf.xml.in.
#!/bin/bash
cd /opt/apache-cassandra-0.6.4
address=`get-private-ip`
cat conf/storage-conf.xml.in | sed -e "s/PRIVATE_ADDRESS/$address/" > conf/storage-conf.xml

Finally, because I am not going to be using persistent local storage (saving for a future post), I create a script clear-cassandra-data in /usr/local/bin in order to clear the data directory. We will run this on clones before kicking Cassandra.
#!/bin/bash
rm -rf /var/lib/cassandra/data/*

I'm not a fan of hacking ways of starting and stopping services. Before running my mini-Cassandra cluster, I threw together a script that lets me start and stop nicely (at least somewhat nicely). I put this in /etc/init.d/cassandra and then run update-rc.d cassandra defaults in order to install the appropriate symlinks from /etc/init.d/rc*.d.
#!/bin/bash
#
# Simple Cassandra init script.
#
# chkconfig: 345 99 99
# description: Cassandra

### BEGIN INIT INFO
# Provides: 
# Required-Start: $network
# Required-Stop: 
# Should-Start: 
# Should-Stop: 
# Default-Start: 3 4 5
# Default-Stop: 0 1 2 6
# Short-Description: start and stop cassandra daemon
# Description: Cassandra
### END INIT INFO

PROG="/opt/apache-cassandra-0.6.4/bin/cassandra -f"
PROGNAME="Cassandra"
LOGFILE=/var/log/cassandra.log
PIDFILE=/var/run/cassandra.pid

running(){
    PID=`cat $PIDFILE 2>/dev/null`

    # Check that the pid is sane.
    if [ "x$PID" == "x" ] ; then
        return 1
    fi

    # Check that the process is alive.
    ps $PID >/dev/null 2>&1 || return 1

    # Looks okay.
    return 0
}

start(){
    echo -n $"Starting $PROGNAME: "

    # Try to start the program.
    if running; then
        echo "Failed.  Maybe remove $PIDFILE?"
        return 1
    fi

    mkdir -p `dirname $LOGFILE`
    $PROG > $LOGFILE 2>&1 &
    PID=$!
    mkdir -p `dirname $PIDFILE`
    echo $PID > $PIDFILE

    echo "Success."
    return 0
}

stop(){
    echo -n $"Stopping $PROGNAME: "

    # Check if it's already stopped.
    if ! running ; then
        echo "Failed.  Already stopped."
        return 1
    fi 

    # Find the PID and kill it.
    PID=`cat $PIDFILE 2>/dev/null`
    if [ "x$PID" == "x" ] ; then
        echo "Failed."
        return 1
    fi
    # (Try five times to kill it).
    for i in `seq 0 5`; do
        kill $PID
        sleep 1
        if ! running ; then
            break
        fi
    done

    # Check if it is finished.
    if running ; then
        echo "Failed."
        return 1
    fi

    # Clear out the pidfile.
    echo "Success."
    rm -f $PIDFILE
    return 0
}

restart(){
    stop
    start
}

status(){
    echo -n $"Status of $PROGNAME: "

    if running ; then
        echo "Running."
        return 0
    else
        echo "Not running."
        return 1
    fi
}

# See how we were called.
case "$1" in
    start)
 start
 RETVAL=$?
 ;;
    stop)
 stop
 RETVAL=$?
 ;;
    status)
 status
 RETVAL=$?
 ;;
    restart)
 restart
 RETVAL=$?
 ;;
    *)
 echo $"Usage: $0 {start|stop|status|restart}"
 RETVAL=2
esac

exit $RETVAL

Tying everything together, I create a simple post-clone script that will stop Cassandra, reconfigure, reset the data directory, and start Cassandra. This I will put in /etc/gridcentric/post-clone-on-clone.d/cassandra so that it runs on every clone immediately after creation. Recall that we set the Cassanda seed to point to the master, so after cloning each clone will automatically join the ring and discover the others.
/etc/init.d/cassandra stop
recreate-cassandra-config
clear-cassandra-data
/etc/init.d/cassandra start

Creating the cluster

We're finally ready to create the cluster! To start a single instance of Cassandra on the master, we simply run:
/etc/init.d/cassandra start
Now I am free to use nodetool to query Cassandra and start storing data, etc. As an example, you can run /opt/apache-cassandra-0.6.4/bin/nodetool -h localhost ring. This shows the current nodes that are in the Cassandra cluster (should be just one). In order to automatically grow this cluster, we simply clone this virtual machine. Here, I'll create a cluster of size 5:
gc clone 4 # Creates a cluster of size 5.
Now check out /opt/apache-cassandra-0.6.4/bin/nodetool -h localhost ring!

Check out the video below.  I start just before starting Cassandra and creating the cluster. I show the ring continuously after starting Cassandra. This is all in real-time (except when I pause, so you can read the labels!), including the cloning of virtual machines.



Update: Sorry for the somewhat poor quality of the video. I'll see if I can get a higher quality version uploaded soon so that it's clear what's going on. Note that you can click on the video and it will take you to Youtube.

Wednesday, August 4, 2010

Elastic Build Systems

I am a firm supporter of Continuous Integration (CI) and throughout my development career I have used CruiseControl, TeamCity and finally at GridCentric we are using Hudson. I am actually really proud of the setup that we have going on here. In addition to the usual tasks Hudson performs, such as compiling our source, running unit tests or packaging our distribution, we are also creating up-to-date documentation of both the JavaDocs and the database schema. It is really cool to have your database schema document automatically updated whenever you push a change set.

Eventually as the project grows -- more sub-projects being supported by the CI server, more people committing, more tasks being done on each commit -- it will start to become too much for a single machine to handle. TeamCity supports distributing the build over multiple machines using what it calls agents, and Hudson does something similar with its notion of slaves.


Great! We distribute the CI server and we can grow indefinitely living in our happy world of Continuous Integration.

Wait, one second...

Instead of having a single machine that is easy to update we now have a handful of machines. Each one in isolation is easy to manage, but now we have to worry about ensuring they are kept in sync. Suppose we realize we need a new version of Ant, or our project's Maven settings.xml changes. Instead of simply modifying these changes in one spot, we now need to log into each machine in our build cluster and modify them. That doesn't sound like fun, it goes against my DRY philosophy and I can foresee some things falling through the cracks every now and then.

Hudson even comes with this warning:
Also note that the slaves are a kind of a cluster, and operating a cluster (especially a large one or heterogeneous one) is always a non-trivial task. For example, you need to make sure that all slaves have JDKs, Ant, CVS, and/or any other tools you need for builds. You need to make sure that slaves are up and running, etc. Hudson is not a clustering middleware, and therefore it doesn't make this any easier.



Things only start to look worse when you go from a single team, to view a company with multiple teams all trying to keep their mini-build clusters synchronized. The probability that something will go overlooked increases, and at any given time probably half of those machines are just idling waiting for a build job. This is a waste of resources -- both in terms of physical machines consuming power doing nothing, and in developer's time tracking down build issues because one machine wasn't upgraded properly.



Fortunately, GridCentric's Copper Virtualization Platform makes managing multiple homogeneous clusters extremely easy.

Copper provides a very simple, but powerful, API call within the virtual machine: clone. Within seconds the virtual machine can create multiple running replicas of itself at the state just prior to being cloned. In addition, the clone virtual machines are automatically networked with the original virtual machine on a private network only visible to them. In other words, by using the clone API call we can very easily create a homogeneous cluster based on the configuration of a single machine within seconds.



Need to make a configuration change to your build cluster, such as manually installing a library into the Maven repository, or installing an interesting Ant plugin? Simply destroy your existing cluster (a couple of seconds), perform the configuration changes on your original virtual machine, then clone the cluster back into existence (just a couple of seconds). Keeping your build cluster's configuration in sync could not be easier, or faster.



Of course Copper gives this powerful cloning ability to any virtual machine booted into the platform, enabling it to support multiple build clusters. In fact Copper makes consolidating all the mini-build clusters within an organization onto the same physical hardware possible and simple.

If a new CI server is required, simply boot up a virtual machine in Copper and install the server on it. When it becomes time to distribute the server, clone the virtual machine running it, and its footprint will grow without any additional configuration.

The next piece of the puzzle is to teach the Continuous Integration servers about this powerful clone API call so that they can automatically scale themselves up into a distributed system when needed, but can also scale back when running idle. Once this piece is solved we'll have a truly remarkable CI server and there will be no excuse about lacking resources for every team within an organization to be doing Continuous Integeration.