So let's talk about grid queueing. Before I wax poetic about how GridCentric's cluster operating system makes grid queueing better, it's probably a good idea to start with a delineation of what the current state of grid queueing is, and some of the problems we're trying to solve for people who use it.
Conceptually, grid queueing is simple: you have a cluster, and you want to utilize the free compute nodes in the cluster to execute programs. So you install a grid queueing engine, which is aware of all the computers in your cluster, and you submit your jobs to it. The queueing engine figures out what computers are capable of executing your jobs, what other jobs have been submitted by other people, prioritizes them according to its configuration, and communicates with the hosts on the cluster to run them. It monitors and stores the execution and state of these jobs and reports them to the user as requested.
Pretty straightforward, yes? Well.. conceptually yes.. but the devil, as always, is in the details.
The Setup
Let's say Uma the user, an analyst, has three types of heavyweight programs she uses routinely:
1. A reporting program that analyzes logs and generates and XML reports. This program is written in python and uses a wide set of python libraries, including bindings to a C XML library.
2. A JDBC client that connects to her company's database, retrieves a dataset, and performs a long-running analysis on it, storing the results back into the database.
3. An internal C++ application that takes in raw input data collected from surveys and stored as flat files on a network filesystem, and prepares them for entry into the database.
The Old Way
Let's assume we already have a traditional, non-Copper, cluster, installed and configured appropriately with operating systems, networking, etc.
What does the cluster administrator need to do to let Uma run her jobs on the cluster?
1. He needs to install and configure the 3 different applications on all the machines in the cluster. This might be straightforward for one or two machines, but tedious once you get up to dozens, let alone hundreds or thousands of machines. They all need Python, the appropriate Python libraries, java, the appropriate JDBC adaptors, the c++ program, etc.
2. He needs to be wary of conflicts that Uma's tools might have with some other users' tools. He needs to ensure that if another user already has a tool on a host that requires, say, an XML library that conflicts with Uma's requirements, then Uma's tools don't get installed on that machine.
3. He needs to ensure that the network filesystem that Uma's tools access is configured at the same path on all the machines in the cluster.
4. He needs to add a queue to the queueing engine specific to the machines capable of executing Uma's jobs.
5. Even after all this, when Uma actually submits her jobs, she needs to be careful to send it to the appropriate queue as configured by the administrator. If she does that incorrectly, then her jobs fail.
By no means is this a trivial amount of work. In fact, once you scale it up to 100s, or 1000s of machines, it's extremely difficult to manage, as cluster administrators are already well aware.
The Copper Way
So how does Copper help with grid queueing? It helps by eliminating nearly all of the redundancies in the above steps.
Let's assume, as before, that the cluster has been set up, but this time with the Copper platform. None of Uma's applications have been installed or configured.
1. The administrator allocates a new virtual machine image to Uma with the queueing daemon installed (installing the daemon is trivial: just download the appropriate package and install it. There is zero configuration). He then configures Uma's VM to have access to the appropriate network filesystem containers. This can be done quickly and easily through Copper's web interface.
2. The administrator installs and configures the applications on this virtual machine image. This setup is done exactly once, and on a virtual machine independent of all other users and types of jobs that might execute on the cluster, obviating any possible issues with software conflicts. The administrator can even skip this step if Uma is able to install and configure the applications herself. Since Uma's VM is independent of all other VMs, and exists in its own private virtual cluster, the administrator need not be concerned with any security or conflict issues.
3. Uma logs into the virtual machine, and submits jobs to the queueing engine running on it. The jobs she submits can be any command that can be executed on the virtual machine. Copper will ensure through the magic of VM cloning that wherever the job executes, the host it executes on will look and behave exactly like the host it was invoked on. As far as the job is concerned, it is for all practical purposes running on Uma's original VM.
4. There is no step 4.
With Copper, we've dropped the overhead of cluster setup as close as possible to zero, for both the cluster administrator and the user. Uma can even trivially add new applications to execute on the cluster: she simply installs it on her VM, and it implicitly and automatically becomes available as an application that can be queued up to execute on the cluster.
The Lesson
Grid queueing doesn't have to be hard or tedious, no matter what the size of your cluster. Using virtualization, cloning, and the GridCentric high-level toolkit, Copper makes implementing grid queueing a snap.