Something that we think about a lot at GridCentric is how to ensure that resources are used efficiently. In fact, that's what led us to create Copper in the first place -- seeing how difficult it was to share large clusters amongst multiple users, groups or even software versions. We think that the ideas behind Infrastructure-as-a-Service have huge potential to revolutionize computing.
We realize that a lot of enterprises have focused on deriving maximum efficiency from their resources for a long time. I thought that I'd take a ground-up look at some of the mechanisms used by virtualization to do that, and introduce something that we're developing for our next products in order to take resource multiplexing to the next level.
Example and demo
Our developers are hard at work on our next generation of products, which in addition to allowing instantaneous scaling and solving configuration nightmares, allows you to over-commit your physical resources (both processors and memory). This quick demo provides a great example of over-subscription, showing a development version of our platform cramming 16 gigabytes worth of clones onto a single host with 8 gigabytes of memory. Awesome.
If you're unfamiliar with Copper, each of the clone VMs created in the video is an independent machine with its own memory, disk and network initially identical to the master at the time of cloning. Each of the 16 VMs could have been allocated to a user or used to perform a specific task independently of the rest. At the end, if we look at the resources allocated by VMs on this host within our Copper installation, we see the following:
$ gc-host stat node8
CPU(s) 16 of 4 (400%)
RAM 17408 of 5631 (309%)
Local disk 0 of 192258490368 MB (0%)
Available Named Storage ['kv-local(10485760000)']
Used Named Storage []
Available PCI Devices ['VGA(0000:01:05.0)']
Used PCI Devices []
Available Networks ['default', 'gridcentric']
Available Tags ['development']
To be fair, the memory shown is without the extra overhead from the management domain (2.5 gigabytes). Pessimistically, we are achieving an actual memory over-subscription of ~212%, not 309%. Still, not too shabby.
We achieve over-subscription by leveraging our novel cloning mechanism. In this post, I take a look at over-subscription in general and touch on some of the mechanisms used to support squeezing more out of your resources. For simplicity, I will focus on two crucial computing resources which are typically managed by an operating system or hypervisor: the processor (CPU) and memory. This post assumes a basic familiarity with virtualization concepts, but it will be gentle.
CPU over-subscription with multi-processing
Traditional operating systems have been over-subscribing and multiplexing resources since the days of the mainframe. Contrary to what it might seem, multiplexing a resource may actually allow you to use it far more efficiently than just giving it to a single process or user (when you consider everything that it is doing). Multiplexing CPUs with virtual machines is a primary motivator for consolidation, since CPUs are often an underutilized component on a typical enterprise server.
Suppose we have four virtual machines (VMs) that perform some work (the dark color) followed by some waiting for an I/O event (the light color). If you had an ssh connection to a VM, this might consist of processing an incoming TCP packet, the terminal and shell doing a bit of work, generating and sending a response packet (with appropriate ACK number) then waiting for the next packet to arrive.
Virtual machines are not much different than processes in an operating system. Because most processes inside each VM are I/O-bound, VMs are also generally I/O-bound as a whole (depending, of course, on the workload). This means that VMs often spend significant time waiting for I/O. If we were to just schedule just a single VM on a CPU, the execution would look much like the following.
Instead of maximizing efficiency, the CPU would actually be doing nothing most of the time. If the VMs B, C, and C were similar and we had four distinct processors (one assigned to each), we would imagine them to running as follows.
I've staggered them to show that we could actually come up with a very efficient schedule of execution on a single CPU. Since each VM spends most of its time waiting (and the CPU is not required for I/O), we could simply overlay the above schedules.
When a virtual machine doesn't have work to do, we instead run another virtual machine. When an I/O request completes, we reschedule the appropriate VM which can once again perform useful work. We make this a bit more complex with limits in how long each virtual machine can run consecutively (the quantum), priorities, interrupt handling, gang scheduling, etc. But the above is the undeniably essence of all CPU over-commit: simple time-sharing.
Memory over-subscription
Memory over-subscription is where things get interesting. There are several different approaches that we can take in order to over-commit our physical memory.
In a normal virtualization situation, VMs each have a fixed amount of memory, the sum of which is less than or equal to the total memory available on the physical machine. Memory of the physical system is divided up into pages by the hardware, which the hypervisor can arbitrarily remap (therefore the VMs do not need contiguous memory). In a non-over-committed situation, there is an injective mapping from the memory of the VMs to the pages of physical memory. This situation is shown below, with colors corresponding to VMs and the numbers corresponding to the contents of their memory.
Paging / swap
Paging is the process in which a page is taken out of memory (possibly to a swap partition or pagefile), the corresponding page table entry is marked as not present, and the page is returned to memory on-demand. Operating systems have been paging out process memory to disk for a long time. This allows you to run more processes than you have space for or a single processes that requires more memory than the physical memory you have available. Most optimization and research around paging is focused on selecting the correct pages to remove from physical memory, as the penalty incurred for unexpectedly having to bring a page back in is quite high (disks are extremely slow compared to memory).
In the below example, we see that we want to run VMs that have more memory than is available on the physical machine. In this case, several pages from VM C have been marked not present by the underlying hypervisor. If VM C is scheduled and attempts to access these pages, it will be blocked while the system fetches those pages from secondary storage. When that happens, we can assume that another page (belonging to one of A, B or C) will be paged out to disk to make room for the incoming one.
For virtualization specifically, it's useful to divide hypervisors into two different camps: type-I hypervisors run on the bare metal (e.g. VMWare ESX, Xen) while type-II hypervisors run on top of an existing operating system (e.g. VMWare Workstation, KVM, VirtualBox).
Type-II hypervisors often leverage existing operating system infrastructure in order to provide paging for virtual machines. The fact that they are running on an existing operating system means that they can leverage many pre-existing mechanisms. For type-I hypervisors, as far as I know, only VMWare's ESX products offer a paging mechanism today. Several research projects have explored adding forms of paging to Xen, but as far as I know it's still a work-in-progress in the main tree.
Paging is a useful fall-back mechanism, and does not require modifications within guest VMs. Unfortunately, it can be very expensive (in terms of a performance hit) and may suffer from poor interactions with the guest operating systems (double paging).
Ballooning
Ballooning is a very common mechanism used to support creating VMs with more memory than the host can support. In essence, ballooning simply forces VMs to share with each other by requiring that they return some of their memory to the hypervisor. This is done by dynamically inflating a balloon driver within the guest operating system, then informing the hypervisor which pages have been allocated so that they may be used by other VMs. This is a very effective mechanism, and generally does not suffer from poor interactions with guest operating systems (unless memory requirements change rapidly). However, it is not transparent to the guest operating systems -- they are aware that they are missing memory and require the special balloon driver to be installed.
An example of ballooning is shown below. The grey pages have been allocated by the balloon driver within the guest and returned to the hypervisor (we could say that the balloons have been inflated slightly within the VMs).
Page sharing
By far the neatest of the space-saving techniques, some of VMWare's products support a mechanism they call content-based page sharing. Essentially, the hypervisor continually spends a relatively small amount of CPU time crawling memory and hashing pages. If it identifies two pages with the same contents, it maps those pages to the same underlying physical page and frees one of the copies. Of course, this single copy is marked read-only and will be copied if one of the VMs needs to change the contents (copy-on-write). Thus, if you are running a lot of similiar or identical VMs, there can be significant savings.
Similiar to paging, research projects (1, 2) have explored this capability for Xen, but it has not quite made it into mainline. In practice, savings for production workloads will likely be very small. According to the few numbers I've found from VMWare, it seems to be in 5-10% for simple workloads. Windows rewrites binaries heavily, so one is unlikely to find many similar pages outside of the kernel code pages, some fixed-address DLLs (win32 probably always gets its preferred address) and pages in the buffer cache. I suspect that most savings (in whitepapers, datasheets, etc.) likely come from identification of unused zero-pages which would also be nicely gobbled up by an automated and co-operative balloon (an example -- each VM is only using about 5% of its memory). An example of page sharing is shown below, where the hypervisor has identified the common pages and remapped appropriately, leading to significant savings.
Summary
For clarity, the three approaches for over-subscription memory that I've touched upon are outlined here.
| Technique | Advantage | Disadvantage | Whitepaper symptom |
|---|---|---|---|
| Paging | (Mostly) transparent. As much as you want. | Slow. Poor interactions with guest VMs. | They don't show how much swap is active. |
| Ballooning | Co-operative. Safe. | Guest VM modifications required. Limits guest VM memory. | Big VMs with tiny memory usage (big balloons). |
| Content-based sharing | Transparent. Negligible overhead. | Probably limited gains. | VMs run the same application with the same data. |
Virtualization has come a long way in delivering techniques that allow organizations to get the most out their resources. It's a very tricky problem however, and there is no magic bullet. If you see 5x over-commit with real workloads, it's almost guaranteed that there's either a lot of ballooning or a lot of paging. We'll be talking more about some of these ideas over the next few months, as we put the finishing touches on our own approach to cramming in as much stuff as possible into as little time and space as we can. :)






4 comments: