H.Andrés Lagar-Cavilla,Joseph A.Whitney,Adin Scannell,Philip Patchin,Stephen M.Rumble,
Eyal de Lara,Michael Brudno,M.Satyanarayanan†
University of Toronto and Carnegie Mellon University†
http://sysweb.cs.toronto.edu/snowflock
Abstract
Virtual Machine(VM)fork is a new cloud computing ab-straction that instantaneously clones a VM into multiple replicas running on different hosts.All replicas share the same initial state,matching the intuitive semantics of stateful worker creation.VM fork thus enables the straightforward creation and efficient deployment of many tasks demand-ing swift instantiation of stateful workers in a cloud envi-ronment,e.g.excess load handling,opportunistic job place-ment,or parallel computing.Lack of instantaneous stateful cloning forces users of cloud computing into ad hoc prac-tices to manage application state and cycle provisioning.We present SnowFlock,our implementation of the VM fork ab-straction.To evaluate SnowFlock,we focus on the demand-ing scenario of services requiring on-the-fly creation of hun-dreds of parallel workers in order to solve computationally-intensive queries in seconds.These services are prominent in fields such as bioinformatics,finance,and rendering.Snow-Flock provides sub-second VM cloning,scales to hundreds of workers,consumes few cloud I/O resources,and has neg-ligible runtime overhead.
Categories and Subject Descriptors D.4.7[Operating Systems]:Organization and Design—Distributed Systems;
D.4.1[Operating Systems]:Process Management—Multi-processing/Multiprogramming/Multitasking
General Terms Design,Experimentation,Measurement, Performance
Keywords Virtualization,Cloud Computing
1.Introduction
Cloud computing is transforming the computing landscape by shifting the hardware and staffing costs of managing a computational center to third parties such as Yahoo!or Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute
to lists,requires prior specific permission and/or a fee.
EuroSys’09,April1–3,2009,Nuremberg,Germany.
Copyright©2009ACM978-1-60558-482-9/09/04...$5.00Amazon[EC2].Small organizations and individuals are now able to deploy world-scale services:all they need to pay is the marginal cost of actual resource usage.Virtual machine (VM)technology is widely adopted as an enabler of cloud computing.Virtualization provides many benefits,including security,performance isolation,ease of management,and flexibility of running in a user-customized environment.
A major advantage of cloud computing is the ability to use a variable number of physical machines and VM in-stances depending on the needs of the problem.For exam-ple a task may need only a single CPU during some phases of execution but may be capable of leveraging hundreds of CPUs at other times.While current cloud APIs allow for the instantiation of new VMs,their lack of agility fails to provide users with the full potential of the cloud model. Instantiating new VMs is a slow operation(typically tak-ing“minutes”[EC2]),and the new VMs originate either as fresh boots or replicas of a template VM,unaware of the current application state.This forces cloud users into em-ploying ad hoc solutions that require considerable developer effort to explicitly propagate application state and waste re-sources by pre-provisioning worker VMs that remain mostly idle.Moreover,idle VMs are likely to be consolidated and swapped out[Steinder2007,Wood2007],incurring costly migration delays before they can be used.
We introduce VM fork,a clean abstraction that simplifies development and deployment of cloud applications that dy-namically change their execution footprint.VM fork allows for the rapid(<1second)instantiation of stateful computing elements in a cloud environment.While VM fork is similar in spirit to the familiar UNIX process fork,in that the child VMs receive a copy of all of the state generated by the par-ent VM prior to forking,it is different in three fundamen-tal ways.First,our VM fork primitive allows for the forked copies to be instantiated on a set of different physical ma-chines,enabling the task to take advantage of large compute clusters.In contrast,previous work[Vrable2005]is limited to cloning VMs within the same host.Second,we have made our primitive parallel,enabling the creation of multiple child VMs with a single call.Finally,our VM fork replicates all of the processes and threads of the originating VM.This en-
(a)Sandboxing (b)Parallel Computation
state =trusted_code()ID =VM_fork(1)
if ID.isChild():
untrusted_code(state)
VM_exit()
else:
VM_wait(ID)
ID =VM_fork(N)
if ID.isChild():parallel_work(data[ID])VM_exit()else:VM_wait(ID)(c)Load Handling (d)Opportunistic Job
while(1):
if load.isHigh():ID =VM_fork(1)if ID.isChild():while(1):
accept_work()elif load.isLow():VM_kill(randomID)
while(1):
N =available_slots()ID =VM_fork(N)if ID.isChild():
work_a_little(data[ID])VM_exit()VM_wait(ID)
Figure 1:Four programming patterns based on fork’s stateful cloning .Forked VMs use data structures initialized by the parent,such as data in case (b).Note the implicit fork semantics of instantaneous clone creation.
ables effective replication of multiple cooperating processes,e.g.a customized LAMP (Linux/Apache/MySql/Php)stack.VM fork enables the trivial implementation of several useful and well-known patterns that are based on stateful replication,e.g.,inheriting initialized data structures when spawning new workers.Pseudocode for four of these is il-lustrated in Figure 1–sandboxing of untrusted code,instan-tiating new worker nodes to handle increased load (e.g.due to flash crowds),enabling parallel computation,and oppor-tunistically utilizing unused cycles with short tasks.All four patterns exploit fork’s ability to create stateful workers,and further,they all exploit fork’s ability to instantaneously cre-ate workers.
SnowFlock ,our implementation of the VM fork abstrac-tion,provides swift parallel stateful VM cloning with lit-tle runtime overhead and frugal consumption of cloud I/O resources,leading to good scalability.SnowFlock takes ad-vantage of several key techniques.First,SnowFlock utilizes lazy state replication to minimize the amount of state prop-agated to the child VMs.Lazy replication allows for ex-tremely fast instantiation of clones by initially copying the minimal necessary VM data,and transmitting only the frac-tion of the parent’s state that clones actually need.Second,a set of avoidance heuristics eliminate substantial superfluous memory transfers for the common case of clones allocating new private state.Finally,exploiting the likelihood of child VMs to execute very similar code paths and access common data structures,we use a multicast distribution technique for VM state that provides scalability and prefetching.
We evaluated SnowFlock by focusing on a demanding in-stance of Figure 1(b):interactive parallel computation,in which a VM forks multiple workers in order to carry out a short-lived,computationally-intensive parallel job.We have conducted experiments with applications from bioinformat-ics,quantitative finance,rendering,and parallel compilation.These applications are deployed as Internet services [NCBI,
EBI]that leverage mass parallelism to provide interactive (tens of seconds)response times to complex queries:find candidates similar to a gene,predict the outcome of stock options,render an animation,etc.On experiments conducted with 128processors,SnowFlock achieves speedups coming within 7%or better of optimal execution,and offers sub-second VM fork irrespective of the number of clones.Snow-Flock is an order of magnitude faster and sends two orders of magnitude less state than VM fork based on suspend/resume or migration.
2.VM Fork
The VM fork abstraction lets an application take advantage of cloud resources by forking multiple copies of its VM,that then execute independently on different physical hosts.VM fork preserves the isolation and ease of software develop-ment associated with VMs,while greatly reducing the per-formance overhead of creating a collection of identical VMs on a number of physical machines.
The semantics of VM fork are similar to those of the fa-miliar process fork:a parent VM issues a fork call which creates a number of clones,or child VMs.Each of the forked VMs proceeds with an identical view of the system,save for a unique identifier (vmid )which allows them to be dis-tinguished from one another and from the parent.However,each forked VM has its own independent copy of the operat-ing system and virtual disk,and state updates are not propa-gated between VMs.
A key feature of our usage model is the ephemeral nature of children.Forked VMs are transient entities whose mem-ory image and virtual disk are discarded once they exit.Any application-specific state or values they generate (e.g.,a re-sult of computation on a portion of a large dataset)must be explicitly communicated to the parent VM,for example by message passing or via a distributed file system.
VM fork has to be used with care as it replicates all the processes and threads of the parent VM:conflicts may arise if multiple processes within the same VM simultaneously invoke VM forking.We envision that VM fork will be used in VMs that have been carefully customized to run a single application or perform a specific task,such as serving a web page.The application has to be cognizant of the VM fork semantics,e.g.,only the “main”process calls VM fork in a multi-process application.
The semantics of VM fork include integration with a ded-icated,isolated virtual network connecting child VMs with their parent.Upon VM fork,each child is configured with a new IP address based on its vmid,and it is placed on the same virtual subnet as the VM from which it was cre-ated.Child VMs cannot communicate with hosts outside this virtual network.Two aspects of our design deserve further comment.First,the user must be conscious of the IP re-configuration semantics:for instance,network shares must be (re)mounted after cloning.Second,we provide a NAT
3.Design Rationale
Performance is the greatest challenge to realizing the full potential of the VM fork paradigm.VM fork must swiftly replicate the state of a VM to many hosts simultaneously. This is a heavyweight operation as VM instances can easily occupy GBs of RAM.While one could implement VM fork using existing VM suspend/resume functionality,the whole-sale copying of a VM to multiple hosts is far too taxing,and decreases overall system scalability by clogging the network with gigabytes of data.
Figure2illustrates this by plotting the cost of suspend-ing and resuming a1GB VM to an increasing number of hosts over NFS(see Section5for details on the testbed). As expected,there is a direct relationship between I/O in-volved and fork latency,with latency growing to the order of hundreds of seconds.Moreover,contention caused by the simultaneous requests by all children turns the source host into a hot spot.Despite shorter downtime,live migra-tion[Clark2005,VMotion],a popular mechanism for con-solidating VMs in clouds[Steinder2007,Wood2007],is fundamentally the same algorithm plus extra rounds of copy-ing,thus taking longer to replicate VMs.
A second approximation to solving the problem of VM fork latency uses our multicast library(see Section4.5)to leverage parallelism in the network hardware.Multicast de-livers state simultaneously to all hosts.Scalability in Fig-ure2is vastly improved,but overhead is still in the range of minutes.To move beyond this,we must substantially reduce the total amount of VM state pushed over the network.
children to allocate memory after forking;and(iv)children often execute similar code and access common data struc-tures.
Thefirst two insights led to the design of VM Descriptors, a lightweight mechanism which instantiates a new forked VM with only the critical metadata needed to start execu-tion on a remote site,and Memory-On-Demand,a mecha-nism whereby clones lazily fetch portions of VM state over the network as it is accessed.Our experience is that it is pos-sible to start a child VM by shipping only0.1%of the state of the parent,and that children tend to only require a frac-tion of the original memory image of the parent.Further,it is common for children to allocate memory after forking,e.g., to read portions of a remote dataset or allocate local stor-age.This leads to fetching of pages from the parent that will be immediately overwritten.We observe that by augmenting the guest OS with avoidance heuristics,memory allocation can be handled locally by avoiding fetching pages that will be immediately recycled.We show in Section5that this op-timization can reduce communication drastically to a mere 40MBs for application footprints of1GB(4%!).Whereas these observations are based on our work with parallel work-loads,they are likely to hold in other domains where a parent node spawns children as workers that execute limited tasks, e.g.,load handling in web services.
Compared to ballooning[Waldspurger2002],memory-on-demand is a non-intrusive approach that reduces state transfer without altering the behaviour of the guest OS.Bal-looning a VM down to the easily manageable footprint that our design achieves would trigger swapping and lead to abrupt termination of processes.Another non-intrusive ap-proach for minimizing memory usage is copy-on-write,used by Potemkin[Vrable2005].However,copy-on-write lim-its Potemkin to cloning VMs within the same host whereas we fork VMs across physical hosts.Further,Potemkin does not provide runtime stateful cloning,since all new VMs are copies of a frozen template.
To take advantage of high correlation across memory ac-cesses of the children(insight iv)and prevent the parent from becoming a hot-spot,we multicast replies to memory page requests.Multicast provides scalability and prefetching:it may service a page request from any number of children with a single response,simultaneously prefetching the page for all children that did not yet request it.Our design is based on the observation that the multicast protocol does not need to provide atomicity,ordering guarantees,or reliable deliv-ery to prefetching receivers in order to be effective.Chil-dren operate independently and individually ensure delivery of needed pages;a single child waiting for a page does not prevent others from making progress.
Lazy state replication and multicast are implemented within the Virtual Machine Monitor(VMM)in a manner transparent to the guest OS.Our avoidance heuristics im-prove performance by adding VM-fork awareness to the guest.Uncooperative guests can still use VM fork,with re-duced efficiency depending on application memory access patterns.4.SnowFlock Implementation
SnowFlock is our implementation of the VM fork primi-tive.SnowFlock is an open-source project[SnowFlock]built on the Xen3.0.3VMM[Barham2003].Xen consists of a hypervisor running at the highest processor privilege level, controlling the execution of domains(VMs).The domain kernels are paravirtualized,i.e.aware of virtualization,and interact with the hypervisor through a hypercall interface.A privileged VM(domain0)has control over hardware devices and manages the state of all other domains.
SnowFlock is implemented as a combination of modifi-cations to the Xen VMM and daemons that run in domain0. The SnowFlock daemons form a distributed system that con-trols the life-cycle of VMs,by orchestrating their cloning and deallocation.SnowFlock defers policy decisions,such as resource accounting and the allocation of VMs to phys-ical hosts,to suitable cluster management software via a plug-in architecture.SnowFlock currently supports alloca-tion management with Platform EGO[Platform]and Sun Grid Engine[Gentzsch2001].Throughout this paper we use a simple internal resource manager which tracks memory and CPU allocations on each physical host.
SnowFlock’s VM fork implementation is based on lazy state replication combined with avoidance heuristics to min-imize state transfer.In addition,SnowFlock leverages mul-ticast to propagate state in parallel and exploit the substan-tial temporal locality in memory accesses across forked VMs to provide prefetching.SnowFlock uses four mechanisms to fork a VM.First,the parent VM is temporarily suspended to produce a VM descriptor:a smallfile that contains VM metadata and guest kernel memory management data.The VM descriptor is then distributed to other physical hosts to spawn new VMs;the entire operation is complete in sub-second time.Second,our memory-on-demand mechanism, memtap,lazily fetches additional VM memory state as ex-ecution proceeds.Third,the avoidance heuristics leverage the cooperation of the guest kernel to substantially reduce the amount of memory that needs to be fetched on demand. Finally,our multicast distribution system mcdist is used to deliver VM state simultaneously and efficiently,as well as providing implicit pre-fetching.
The next subsection describes the SnowFlock API.We then describe in detail each of the four SnowFlock mech-anisms.For each we present micro-benchmark results that show their effectiveness(see Section5for testbed details). Wefinish this section by discussing the specifics of the vir-tual I/O devices of a SnowFlock VM,namely the virtual disk and network isolation implementations.
4.1API
Table1describes the SnowFlock API.VM fork in Snow-Flock consists of two stages.First,the application uses sf_request_ticket to place a reservation for the desired number of clones.To optimize for common use cases in
•sf_request_ticket(n,hierarchical):Requests an allocation for n clones.If hierarchical is true,process fork will follow VM fork,to occupy the cores of SMP cloned VMs.Returns a ticket describing an allocation for m≤n clones.
•sf_clone(ticket):Clones,using the allocation in the ticket.Re-turns the clone ID,0≤ID≤m.
•sf_exit():For children(1≤ID≤m),terminates the child.
•sf_join(ticket):For the parent(ID=0),blocks until all children in the ticket reach their sf_exit call.At that point all children are terminated and the ticket is discarded.
•sf_kill(ticket):Parent only,immediately terminates all children in ticket and discards the ticket.
Table1:The SnowFlock VM Fork API
SMP hardware,VM fork can be followed by process repli-cation:the set of cloned VMs span multiple hosts,while the processes within each VM span the physical underlying cores.This behaviour is optionally available if the hierar-chicalflag is set.Due to user quotas,current load,and other policies,the cluster management system may allocate fewer VMs than requested.In this case the application has the op-tion to re-balance the computation to account for the smaller allocation.In the second stage,we fork the VM across the hosts provided by the cluster management system with the sf_clone call.When a child VMfinishes its part of the com-putation,it executes an sf_exit operation which terminates the clone.A parent VM can wait for its children to terminate with sf_join,or force their termination with sf_kill.
The API calls from Table1are available to applications via a SnowFlock client library,with C and Python bindings. The client library marshals API calls and communicates them to the SnowFlock daemon running on domain0over a shared memory interface.
While the SnowFlock API is simple andflexible,it nonetheless demands modification of existing code bases.
A SnowFlock-friendly implementation of the widely used Message Passing Interface(MPI)library allows a vast cor-pus of unmodified parallel applications to use SnowFlock’s capabilities.Based on mpich2[MPICH],our implementa-tion replaces the task-launching subsystem of the library by one which invokes sf_clone to create the desired number of clones.Appropriately parameterized worker processes are started on each clone.Each worker uses unmodified MPI routines from then on,until the point of application termination,when the VMs are joined.Future work plans include performing a similar adaptation of the MapReduce toolkit[Dean2004].
4.2VM Descriptors
A VM Descriptor is a condensed VM image that allows swift VM replication to a separate physical host.Construction of a VM descriptor starts by spawning a thread in the VM kernel that quiesces its I/O devices,deactivates all but one of the virtual processors(VCPUs),and issues a hypercallsuspending the VM’s execution.When the hypercall suc-ceeds,a privileged process in domain0maps the suspended VM memory to populate the descriptor.The descriptor con-tains:(1)metadata describing the VM and its virtual devices, (2)a few memory pages shared between the VM and the Xen hypervisor,(3)the registers of the main VCPU,(4)the Global Descriptor Tables(GDT)used by the x86segmenta-tion hardware for memory protection,and(5)the page tables of the VM.
The page tables make up the bulk of a VM descriptor. In addition to those used by the guest kernel,each process in the VM generally needs a small number of additional page tables.The cumulative size of a VM descriptor is thus loosely dependent on the number of processes the VM is executing.Entries in a page table are“canonicalized”before saving.They are translated from references to host-specific pages to frame numbers within the VM’s private contiguous physical space(“machine”and“physical”addresses in Xen parlance,respectively).A few other values included in the descriptor,e.g.the cr3register of the saved VCPU,are also canonicalized.
The resulting descriptor is multicast to multiple physical hosts using the mcdist library we describe in Section4.5, and used to spawn a clone VM on each host.The metadata is used to allocate a VM with the appropriate virtual devices and memory footprint.All state saved in the descriptor is loaded:pages shared with Xen,segment descriptors,page tables,and VCPU registers.Physical addresses in page table entries are translated to use the new mapping between VM-specific physical addresses and host machine addresses.The VM replica resumes execution,enables the extra VCPUs, and reconnects its virtual I/O devices to the new frontends.
Figure3:Fast Clone Creation.Legend order matches bar stacking from top to bottom.
Evaluation Figure3presents our evaluation of the VM descriptor mechanism,showing the time spent replicating a single-processor VM with1GB of RAM to n clones in n physical hosts.The average size of a VM descriptor for these experiments was1051±7KB.The time to create a descriptor is“Save Time”(our code)plus“Xend Save”(re-cycled and unmodified Xen code).“Starting Clones”is the time spent distributing the order to spawn a clone to each host.Clone creation in each host is composed by“Fetch De-scriptor”(wait for the descriptor to arrive),“Restore Time”(our code)and“Xend Restore”(recycled Xen code).At this point,all clones have begun execution.
Overall,VM replication is a fast operation,ranging in general from600to800milliseconds.Further,VM replica-tion time is largely independent of the number of clones cre-ated.Larger numbers of clones introduce,however,a wider variance in the total cloning time.The variance is typically seen in the time to multicast the VM descriptor,and is due in part to a higher likelihood that on some host a scheduling or I/O hiccup might delay the VM resume for longer than the average.Despite this small variance,the net result is sub-second VM cloning time irrespective of the size of the VM.
4.3Memory-On-Demand
Immediately after being instantiated from a descriptor,the VM willfind it is missing state needed to proceed.In fact, the code page containing the veryfirst instruction the VM tries to execute upon resume will be missing.SnowFlock’s memory-on-demand subsystem,memtap,handles this situa-tion by lazily populating the clone VM’s memory with state fetched from the parent,where an immutable copy of the VM’s memory from the time of cloning is kept.
Memtap is a combination of hypervisor logic and a user-space domain0process associated with the clone VM.Mem-tap implements a copy-on-access policy for the clone VM’s memory.The hypervisor detects when a missing page will be accessed for thefirst time by a VCPU,pauses that VCPU and notifies the memtap process.The memtap process maps the missing page,fetches its contents from the parent,and notifies the hypervisor that the VCPU may be unpaused.
To allow the hypervisor to trap memory accesses to pages that have not yet been fetched,we use Xen shadow page ta-bles.In shadow page table mode,the x86register indicat-ing the page table currently in use(cr3)is replaced by a pointer to an initially empty page table.The shadow page table isfilled on demand as faults on its empty entries occur, by copying entries from the real page table.Shadow page ta-ble faults thus indicate that a page of memory is about to be accessed.If this is thefirst access to a page of memory that has not yet been fetched,the hypervisor notifies memtap. Fetches are also triggered by trapping modifications to page table entries,and accesses by domain0of the VM’s memory for the purpose of virtual device DMA.
On the parent VM,memtap implements a copy-on-write policy to serve the memory image to clone VMs.To preserve a copy of the memory image at the time of cloning,while still allowing the parent VM to execute,we use shadow page tables in“log-dirty”mode.All parent VM memory write attempts are trapped by disabling the writable bit onshadow page table entries.Upon a write fault,the hypervisor duplicates the page and patches the mapping of the memtap server process to point to the duplicate.The parent VM is then allowed to continue execution.
Our implementation of memory-on-demand is SMP-safe.
A shared bitmap is used by Xen and memtap to indicate the presence of VM memory pages.The bitmap is initialized when the VM is built from a descriptor,and is accessed in a lock-free manner with atomic(test_and_set,etc) operations.When trapping a shadow page table on-demand fill,Xen checks the present bit of the faulting page,notifies memtap,and buffers the write of the shadow entry.Another VCPU using the same page table entry will fault on the still empty shadow entry.Another VCPU using a different page table entry but pointing to the same VM-physical address will also fault on the not-yet-set bitmap entry.In both cases the additional VCPUs are paused and then queued,waiting for thefirst successful fetch of the missing page.When memtap notifies completion of the fetch,the present bit is set,pending shadow page table writes are applied,and all queued VCPUs are unpaused.
Evaluation To understand the overhead involved in our memory-on-demand subsystem,we devised a microbench-mark in which a VM allocates andfills in a number of mem-ory pages,invokes SnowFlock to have itself replicated,and then touches each page in the allocated set.The results for multiple microbenchmark runs totaling ten thousand page fetches are displayed in Figure4(a).
The overhead of page fetching is modest,averaging 275µs with unicast(standard TCP).We split a page fetch operation into six components.“Page Fault”indicates the hardware page fault overheads caused by using shadow page tables.“Xen”is the cost of the Xen hypervisor shadow page table logic.“HV Logic”is the time consumed by our hy-pervisor logic:bitmap checking and SMP safety.“Dom0 Switch”is the time to context switch to the domain0mem-tap process,while“Memtap Logic”is the time spent by the memtap internals,consisting mainly of mapping the fault-ing VM page.The bulk of the page-fetching time is spent in the sixth component,“Network”,which depicts the software (libc and Linux kernel TCP stack)and hardware overheads of remotely fetching the page contents over gigabit Ethernet. Our implementation is frugal and efficient,and the bulk of the overhead(82%)comes from the network stack.
4.4Avoidance Heuristics
The previous section shows that memory-on-demand guar-antees correct VM execution and is able to fetch pages of memory with speed close to bare TCP/IP.However,fetching pages from the parent still incurs an overhead that may prove excessive for many workloads.We thus augmented the VM kernel with two fetch-avoidance heuristics that allow us to bypass large numbers of unnecessary memory fetches,while retaining correctness.
Thefirst heuristic optimizes the general case in which a clone VM allocates new state.The heuristic intercepts pages selected by the kernel’s page allocator.The kernel page al-locator is invoked when more memory is needed by a kernel subsystem,or by a user-space process,typically requested indirectly via a malloc call.The semantics of these opera-tions imply that the recipient of the selected pages does not care about the pages’previous contents.If the pages have not yet been fetched,there is no reason to do so.
The second heuristic addresses the case where a virtual I/O device writes to the guest memory.Consider the case of block I/O:the target page is typically a kernel buffer that is being recycled and whose previous contents do not need to be preserved.Again,there is no need to fetch this page.
The fetch-avoidance heuristics are implemented by map-ping the memtap bitmap in the guest kernel’s address space. When the kernel decides a page should not be fetched,it “fakes”the page’s presence by setting the corresponding bit, and thus prevents Xen or memtap from fetching it. Evaluation We evaluate the effect of the guest avoidance heuristics using SHRiMP,one of the applications described in Section5.1.Our SHRiMP macrobenchmark spawns n uniprocessor VM clones and runs on each a task that de-mands1GB of RAM.Figure4(b)illustrates the results for 32clones.While we also vary the choice of networking sub-strate between unicast and multicast,we study here the effect of the heuristics;we will revisit Figure4(b)in the next sub-section.Experiments with smaller memory footprints and different numbers of clones show similar results and are therefore not shown.
The avoidance heuristics result in substantial benefits, in terms of both runtime and data transfer.Nearly all of SHRiMP’s memory footprint is allocated from scratch when the inputs are loaded.The absence of heuristics forces the VMs to request pages they do not really need,inflating the number of requests from all VMs by two orders of magnitude.With the heuristics,state transmissions to clones are reduced to40MBs,a tiny fraction(3.5%)of the VM’s footprint.
4.5Multicast Distribution
Mcdist is our multicast distribution system that efficiently provides data to all cloned VMs simultaneously.It accom-plishes two goals that are not served by point-to-point com-munication.First,data needed by clones is often prefetched. Once a single clone requests a page,the response also reaches all other clones.Second,the load on the network is greatly reduced by sending a piece of data to all VM clones with a single operation.This improves scalability of the system,as well as better allowing multiple sets of clones to co-exist in the cloud.
The mcdist server design is minimalistic,containing only switch programming andflow control logic.No atomicity or ordering guarantees are given by the server and requests
Legend order matches bar stacking top to bottom.
.
SHRiMP,1GB footprint.Bars show aggregate page
requests from32clones vs.pages sent.Labels
show average benchmark completion times.
(c)Multicast Scales.BLAST,256MB DB.Speedup
vs.one thread,labels show absolute times.
Figure4:Evaluation of SnowFlock Design Principles
are processed on afirst-come,first-served basis.Ensuring reliability thus falls to receivers,through a timeout mech-anism.We use IP-multicast in order to send data to multi-ple hosts simultaneously.IP-multicast is supported by most off-the-shelf commercial Ethernet hardware.Switches and routers maintain group membership information and simul-taneously send a frame destined for a multicast group to all subscribed hosts.We believe this approach scales to large clusters;IP-multicast hardware is capable of scaling to thou-sands of hosts and multicast groups,automatically relaying multicast frames across multiple hops.
The mcdist clients are memtap processes,which will re-ceive pages asynchronously and unpredictably in response to requests by fellow VM clones.For efficiency,memtap clients batch received pages until a threshold is hit,or a page that has been explicitly requested arrives.A single hypercall is invoked to map the pages in a batch.A threshold of1024 pages has proven to work well in practice.
To maximize total goodput,the server usesflow control logic to limit its sending rate and avoid overwhelming busy clients.Both the server and clients estimate their send and receive rate using a weighted average of the number of bytes transmitted or received every ten milliseconds.Clients pro-vide explicit feedback about their current rate to the server in request messages.The server increases its rate limit lin-early and,when a loss is detected through a client request for data that has already been sent,the server scales its rate limit back.We found that scaling the rate back to three quarters of the estimated mean client receive rate works effectively.
Another serverflow control mechanism is lockstep detec-tion,which aims to leverage the similarity in memory access patterns across clones.For example,shortly after cloning, VM clones share the same code paths due to a determinis-tic sequence of kernel functions called during resumption of the suspended VM.Large numbers of identical page requests are generated at ostensibly the same time,i.e.in“lockstep”. Thus,when multiple requests for the same page are received in succession the server ignores duplicate requests immedi-ately following thefirst.If the identical requests are due to lost packets as opposed to lockstep,they will eventually be serviced when the request is retransmitted by the client. Evaluation We evaluate the effects of multicast,revisiting the results obtained with SHRiMP.Recall that in Figure4(b) we spawn32uniprocessor VM clones and run on each a SHRiMP task that demands1GB of RAM.Figure4(b) shows that our multicast distribution’s lockstep avoidance works effectively:lockstep-executing VMs issue simultane-ous requests that are satisfied by a single response from the server.Hence the difference between the“Requests”and “Served”bars in the multicast experiments.Further,even under the extreme pressure of an uncooperative guest with disabled heuristics,the number of pages served is reduced dramatically,and extra overhead reduced to a minute.
Figure4(c)shows the benefit of mcdist for a case where an important portion of memory state is needed after cloning and thus the avoidance heuristics cannot help.Thefigure shows results from an experiment conducted with NCBI BLAST(described in Section5.1),which executes queries against a256MB portion of the NCBI genome database that the parent caches into memory before cloning.Thefig-ure shows speedup results for SnowFlock using unicast and multicast,and an idealized zero-cost fork configuration in which VMs have been previously allocated,with no cloning or state-fetching overhead.Mcdist achieves almost linear speedup,closely tracking the speedup exhibited with ideal execution,while unicast does not scale beyond16clones.
4.6Virtual I/O Devices in SnowFlock
Outside of the four techniques addressing fast and scalable VM replication,SnowFlock provides a virtual disk for each clone VM and guarantees secure network isolation.4.6.1Virtual Disk
The virtual disks of SnowFlock VMs are implemented with a blocktap[Warfield2005]driver.Multiple views of the virtual disk are supported by a hierarchy of copy-on-write (COW)slices located at the site where the parent VM runs. Each fork operation adds a new COW slice,rendering the previous state of the disk immutable,and launches a disk server process that exports the view of the disk up to the point of cloning.Children access a sparse local version of the disk,with the state from the time of cloning fetched on demand from the disk server.The virtual disk exploits the same optimizations as the memory subsystem:unnecessary fetches during writes are avoided using heuristics,and the original disk state is provided to all clients simultaneously via multicast.
In our usage model,the virtual disk is used as the base root partition for the VMs.For data-intensive tasks,we en-vision serving data volumes to the clones through network file systems such as NFS,or suitable big-datafilesystems such as Hadoop[Hadoop]or Lustre[Braam2002].The sep-aration of responsibilities results in our virtual disk not being heavily exercised.Most work done by clones is processor in-tensive,writes do not result in fetches,and the little remain-ing disk activity mostly hits kernel caches.Our implementa-tion largely exceeds the demands of many realistic tasks and did not cause any noticeable overhead for the experiments in Section5.
4.6.2Network Isolation
In order to prevent interference or eavesdropping between unrelated VMs on the shared network,either malicious or accidental,we employ a mechanism to isolate the network. Isolation is performed at the level of Ethernet packets,the primitive exposed by Xen virtual network devices.Before being sent on the shared network,the source MAC addresses of packets sent by a SnowFlock VM are rewritten as a spe-cial address which is a function of both the parent and child identifiers.Simplefiltering rules are used by all hosts to en-sure that no packets delivered to a VM come from VMs that are not its parent or a sibling.Conversely,when a packet is delivered to a SnowFlock VM,the destination MAC address is rewritten to be as expected,rendering the entire process transparent.Additionally,a small number of special rewrit-ing rules are required for protocols with payloads contain-ing MAC addresses,such as ARP.Despite this,filtering and rewriting impose an imperceptible overhead while maintain-ing full IP compatibility.
5.Application Evaluation
Our evaluation of SnowFlock focuses on a particularly de-manding scenario:the ability to deliver interactive paral-lel computation,in which a VM forks multiple workers to participate in a short-lived computationally-intensive paral-lel job.This scenario matches existing bioinformatics web services like BLAST[NCBI]or ClustalW[EBI],and other Internet services like render or compile-farms.Users interact with a web frontend and submit queries that are serviced by an embarrassingly parallel algorithm run on a compute clus-ter.These services are thus capable of providing interactive responses in the range of tens of seconds to computationally-demanding queries.
All of our experiments were carried out on a cluster of32 Dell PowerEdge1950blade servers.Each host had4GB of RAM,4Intel Xeon3.2GHz cores,and a Broadcom NetX-treme II BCM5708gigabit NIC.All machines were running the SnowFlock prototype based on Xen3.0.3,with paravir-tualized Linux2.6.16.29running as the OS for both host and guest VMs.The VMs were configured with1124MB of RAM.All machines were connected to two daisy-chained Dell PowerConnect5324gigabit switches.All results re-ported are the means offive or more runs,and error bars depict standard deviations.
5.1Applications
We tested SnowFlock with3typical applications from bioin-formatics and3applications representative of thefields of graphics rendering,parallel compilation,andfinancial ser-vices.We devised workloads for these applications with run-times ranging above an hour on a uniprocessor machine,but which can be reduced to interactive response times if over a hundred processors are available.Application experiments are driven by a workflow shell script that clones the VM and launches an application process properly parameterized ac-cording to the clone ID.The exception to this technique is ClustalW,where we modify the application code directly. NCBI BLAST[Altschul1997]is perhaps the most popu-lar computational tool used by biologists.BLAST searches a database of biological sequences–strings of characters representing DNA or proteins–tofind sequences similar to a query.We experimented with a BLAST search using 1200short protein fragments from the sea squirt Ciona sav-ignyi to query a1.5GB portion of NCBI’s non-redundant protein database.VM clones access the databasefiles via an NFS share.Database access is parallelized across VMs,each reading a different segment,while query processing is paral-lelized across process-level clones within each VM. SHRiMP[SHRiMP]is a tool for aligning large collections of very short DNA sequences(“reads”)against a known genome.This time-consuming task can be easily parallelized by dividing the collection of reads among many processors. While similar overall to BLAST,SHRiMP is designed for dealing with very short queries and very long sequences,and is more memory intensive.In our experiments we attempted to align1.9million25letter-long reads,extracted from a Ciona savignyi,to a5.2million letter segment of the known C.savignyi genome.
ClustalW[EBI]is a popular program for generating a mul-tiple alignment of a collection of protein or DNA sequences.Like BLAST,ClustalW is offered as a web service by or-ganizations owning large computational resources[EBI]. ClustalW builds a guide tree using progressive alignment, a greedy heuristic requiring precomputation of comparisons between all pairs of sequences.The pairwise comparison is computationally intensive and embarrassingly parallel,since each pair of sequences can be aligned independently.After cloning,each child computes the alignment of a set of pairs statically assigned according to the clone ID.The result of each alignment is a similarity score.Simple socket code al-lows these scores to be relayed to the parent,before joining the forked VMs.Using this implementation we conducted experiments performing guide-tree generation by pairwise alignment of200synthetic protein sequences of1000amino acids(characters)each.
QuantLib[Quantlib]is an open source toolkit widely used in quantitativefinance.It provides models for stock trading, equity option pricing,risk analysis,etc.A typical quantita-tivefinance program using QuantLib runs a model over a large array of parameters(e.g.stock prices,)and is thus eas-ily parallelizable by splitting the input.In our experiments we processed1024equity options using a set of Monte Carlo,binomial and Black-Scholes models while varying the initial and striking prices,and the volatility.The result is the set of probabilities yielded by each model to obtain the de-sired striking price for each option.
Aqsis–Renderman[Aqsis]is an open source implemen-tation of Pixar’s RenderMan interface[Pixar],an industry standard widely used infilms and television visual effects. Aqsis accepts scene descriptions produced by a modeler and specified in the RenderMan Interface Bitstream(RIB)lan-guage.Rendering is easy to parallelize:multiple instances can each perform the same task on different frames of an an-imation.For our experiments we fed Aqsis a sample RIB script from the book“Advanced RenderMan”[Apodaka 2000].
Distcc[distcc]is software which distributes builds of C/C++programs over the network for parallel compilation. Distcc is not embarrassingly parallel:actions are tightly co-ordinated by a parent farming out preprocessedfiles for com-pilation by children.Resulting objectfiles are retrieved from the children for linking.Preprocessed code includes all rel-evant headers,thus simplifying requirements on children to just having the same version of the compiler.In our experi-ments we compile the Linux kernel version2.6.16.29.
5.2Results
We executed the above applications in SnowFlock,enabling the combination of VM fork and process fork to take advan-tage of our SMP hardware.For each application we spawn 128threads of execution:324-core SMP VMs on32physi-cal hosts.We aim to answer the following questions:•How does SnowFlock compare to other methods for in-stantiating VMs?
•How close does SnowFlock come to achieving optimal application speedup?
•How scalable is SnowFlock?How does it perform in cloud environments with multiple applications simulta-neously and repeatedly forking VMs,even under adverse VM allocation patterns?
Comparison Table2illustrates the substantial gains Snow-Flock provides in terms of efficient VM cloning and applica-tion performance.The table shows results for SHRiMP us-ing128processors under three configurations:SnowFlock with all the mechanisms described in Section4,and two ver-sions of Xen’s standard suspend/resume that use NFS and multicast to distribute the suspended VM image.The results show that SnowFlock significantly improves execution time with respect to traditional VM management techniques,with gains ranging between a factor of two and an order of mag-nitude.Further,Snowflock is two orders of magnitude better than traditional VM management techniques in terms of the amount of VM state transmitted.Despite the large memory footprint of the application(1GB),SnowFlock is capable of transmitting in parallel little VM state.Experiments with our other benchmarks show similar results and are therefore not shown.
Time(s)State(MB) SnowFlock70.63±0.6841.79±0.7 S/R over multicast157.29±0.971124
S/R over NFS412.29±11.511124
Table2:SnowFlock vs.VM Suspend/Resume.SHRiMP,128 threads.Benchmark time and VM state sent.
Figure5:Application Benchmarks.Applications run with128 threads:32VMs×4cores.Bars show speedup vs.a single thread zero-cost baseline.Labels show time to completion in seconds. Application Performance Figure5compares SnowFlock to an optimal“zero-cost fork”baseline.We compare against a baseline with128threads to measure overhead,and againsta baseline with one thread to measure speedup.Zero-cost re-sults are obtained with VMs previously allocated,with no cloning or state-fetching overhead,and in an idle state,ready to process the jobs allotted to them.As the name implies, zero-cost results are overly optimistic and not representative of cloud computing environments,in which aggressive con-solidation of VMs is the norm and instantiation times are far from instantaneous.The zero-cost VMs are vanilla Xen 3.0.3domains configured identically to SnowFlock VMs in terms of kernel version,disk contents,RAM,and number of processors.
SnowFlock performs extremely well and succeeds in re-ducing execution time from hours to tens of seconds for all the benchmarks.Moreover,SnowFlock achieves speedups that are very close to the zero-cost optimal,and comes within 7%of the optimal runtime.The overhead of VM replication and on-demand state fetching are small.ClustalW,in par-ticular,yields the best results with less than two seconds of overhead for a25second task.This shows that tighter cou-pling of SnowFlock into application logic is beneficial. Scale and Agility We address SnowFlock’s capability to support multiple concurrent forking VMs.We launch four VMs that each simultaneously forks32uniprocessor VMs. To stress the system,after completing a parallel task,each parent VM joins and terminates its children and immedi-ately launches another parallel task,repeating this cycle five times.Each parent VM runs a different application; we selected the four applications that exhibited the high-est degree of parallelism(and child occupancy):SHRiMP, BLAST,QuantLib,and Aqsis.To further stress the system, we abridged the length of the cyclic parallel task so that each cycle wouldfinish in between20and35seconds.We em-ployed an“adversarial allocation”in which each task uses 32processors,one per physical host,so that128SnowFlock VMs are active at most times,and each physical host needs to fetch state from four parent VMs.The zero-cost results were obtained with an identical distribution of VMs;since there is no state-fetching performed in the zero-cost case, the actual allocation of VMs does not affect those results.
this is mainly due to the small overall number of memory pages sent by the combined efforts of the guest heuristics and multicast distribution.The introduction of multiple forking VMs causes no significant increase in overhead,although outliers with higher time to completion are seen,resulting in wider error bars.These outliers are caused by occasional congestion when receiving simultaneous bursts of VM state for more than one VM;we believe optimizing mcdist will yield more consistent running times.To summarize,Snow-Flock is capable of forking VMs to perform a32-host40-seconds or less parallel computation,withfive seconds or less of overhead,in spite of the pressure of adversarial allo-cations and repeated concurrent forking activity.
6.Related Work
To the best of our knowledge,we are thefirst group to ad-dress the problem of low-latency stateful replication of VMs to facilitate application deployment in cluster or cloud com-puting environments,including the ability to deliver instan-taneous parallelism.A number of projects have explored the area of VM replication.The Potemkin project[Vrable2005] implements a honeypot spanning a large IP address range. Honeypot machines are short-lived lightweight VMs cloned from a static template in the same machine with memory copy-on-write techniques.Potemkin does not address paral-lel applications and does not fork multiple VMs to different hosts.Remus[Cully2008]provides instantaneous failover by keeping an up-to-date replica of a VM in a separate host.
Copy on reference,first used for process migration[The-imer1985]in Accent[Zayas1987],is a precursor to our memory-on-demand technique.Wide-area VM migration projects[Lagar-Cavilla2007,Sapuntzakis2002,Kozuch 2002]have used lazy copy-on reference for VM disk state. The low frequency and coarse granularity of access of sec-ondary storage allows copying large batches of state over low-bandwidth high-latency links.
One objective of SnowFlock is to complement the ca-pabilities of a shared computing platform.The Amazon Elastic Compute Cloud[EC2]is the foremost utility com-puting platform in operation today.While the details are not publicly known,we believe it follows industry standard techniques for the provisioning of VMs on thefly[Stein-der2007]:consolidation via memory sharing[Waldspurger 2002]or ballooning,resuming from disk,live migration[Clark 2005,VMotion],etc.Amazon’s EC2claims to instantiate multiple VMs in“minutes”–insufficient performance for the agility objectives of SnowFlock.
Similarly,work focusing on multiplexing a set of VMs on a physical cluster has typically resorted to standard tech-niques,without addressing issues studied here,such as pro-gramming primitives and performance capabilities.The term “virtual cluster”is used by many projects[Emeneker2007, Foster2006,Chase2003]focusing on resource provision-ing and management.We highlight Usher[McNett2007],a manager of clusters of VMs that could be plugged in as a SnowFlock resource manager.Emulab[Hibler2008]uses virtualization to instantiate dozens of nodes for a network emulation experiment.Experiments are long-lived,statically sized,and instantiation of the nodes takes tens to hundreds of seconds.
Emulab uses Frisbee[Hibler2003]as a multicast distri-bution tool to apply disk images to nodes during experiment setup.Frisbee and mcdist differ in their domain-specific as-pects:e.g.Frisbee usesfilesystem-specific compression,not applicable to memory state;conversely,mcdist’s lockstep detection does not apply to Frisbee’s disk distribution.We view the use of high-speed interconnects such as RDMA or Infiniband[Huang2007],if available,as a viable alternative to multicasting.
7.Conclusion and Future Directions
In this work we introduced the primitive of VM fork and SnowFlock,our Xen-based implementation.Matching the well-understood semantics of stateful worker creation,VM fork provides cloud users and programmers the capacity to instantiate dozens of VMs in different hosts in sub-second time,with little runtime overhead,and frugal use of cloud IO resources.VM fork thus enables the simple implementa-tion and deployment of services based on familiar program-ming patterns that rely on fork’s ability to quickly instantiate stateful workers.While our evaluation focuses on interactive parallel Internet services,SnowFlock has broad applicability to other applications:flash crowd handling,execution of un-trusted code components,instantaneous testing,etc.
SnowFlock makes use of two key observations.First,it is possible to drastically reduce the time it takes to clone a VM by copying only the critical state,and fetching the VM’s memory image efficiently on-demand.Moreover,sim-ple modifications to the guest kernel significantly reduce net-work traffic by eliminating the transfer of pages that will be overwritten.For our application these optimizations can drastically reduce the communication cost for forking a VM to a mere40MBs for application footprints of1GB.Second, the locality of memory accesses across cloned VMs makes it beneficial to distribute VM state using multicast.This allows for the instantiation of a large number of VMs at a(low)cost similar to that of forking a single copy.
SnowFlock is an active open-source project[SnowFlock]. Our future work plans involve adapting SnowFlock to big-data applications.We believe there is fertile research ground studying the interactions of VM fork with data parallel APIs such as MapReduce[Dean2004].For example,Snow-Flock’s transient clones cannot be entrusted with replicating and caching data due to their ephemeral natures.Allowing data replication to be handled by hosts enables benefits such as big-datafile system agnosticism for the cloned VMs.
SnowFlock’s objective is performance rather than relia-bility.While memory-on-demand provides significant per-formance gains,it imposes a long-lived dependency on a single source of VM state.Another aspect of our future work involves studying how to push VM state in the background to achieve a stronger failure model,without sacrificing our speed of cloning or low runtime overhead.
Finally,we wish to explore applying the SnowFlock tech-niques to wide-area VM migration.This would allow,for example,“just-in-time”cloning of VMs over geographical distances to opportunistically exploit cloud resources.We foresee modifications to memory-on-demand to batch mul-tiple pages on each update,replacement of IP-multicast,and use of content-addressable storage at the destination sites to obtain local copies of frequently used state(e.g.libc).
In closing,SnowFlock lowers the barrier of entry to cloud computing and opens the door for new cloud applications. VM fork provides a well-understood programming inter-face with substantial performance improvements;it removes the obstacle of long VM instantiation latencies,and greatly simplifies management of application state.SnowFlock thus places in the hands of users the full potential of the cloud model by simplifying the programming of applications that dynamically change their execution footprint.In particular, SnowFlock is of immediate relevance to users wishing to test and deploy parallel applications in the cloud. Acknowledgments
We thank the anonymous reviewers and our shepherd Or-ran Krieger for their helpful suggestions.We thank Bianca Schroeder,Ryan Lilien,Roy Bryant,Olga Irzak,and Char-lotte Lin for their feedback and help with the SnowFlock project.We also thank Mike Kozuch,Scott Rixner and Y. Charlie Hu for their input on a draft of this paper.
This research was supported by the National Science and Engineering Research Council of Canada(NSERC)under grant number261545-3,a Strategic Grant STPSC356747-07,a Canada Graduate Scholarship,and an Undergraduate Student Research Award;by the Canada Foundation For Innovation(CFI-CRC)under Equipment Grant202977;by the National Science Foundation(NSF)under grant number CNS-0509004;and by Platform Computing Inc. References
[Altschul1997]S.F.Altschul,T.L.Madden,A.A.Schaffer, J.Zhang,Z.Zhang,W.Miller,and D.J.Lipman.Gapped BLAST and PSI–BLAST:a new generation of protein database search programs.Nucleic Acids Res.,25:33–3402,1997. [Apodaka2000]A.A.Apodaka and L.Gritz.Advanced Render-Man:Creating CGI for Motion Pictures.Academic Press,2000. [Aqsis]Aqsis:Open source3D rendering solution adhering to the RenderMan standard.http://aqsis.org/.
[Barham2003]P.Barham,B.Dragovic,K.Fraser,S.Hand,T.Har-ris,A.Ho,R.Neugebauer,I.Pratt,and A.Warfield.Xen and the Art of Virtualization.In Proc.17th Symposium on Operating Systems Principles(SOSP),Bolton Landing,NY,October2003.[Braam2002]P.J.Braam.The lustre storage architecture,2002.
http://www.lustre.org/docs/lustre.pdf.
[Chase2003]J.S.Chase,D.E.Irwin,L.E.Grit,J.D.Moore,and S.E.Sprenkle.Dynamic Virtual Clusters in a Grid Site Man-ager.In Proc.12th Symposium on High Performance Distributed Computing(HPDC),Washington,DC,2003.
[Clark2005]C.Clark,K.Fraser,S.Hand,J.Gorm Hansen,E.Jul,
C.Limpach,I.Pratt,and A.Warfield.Live Migration of Vir-
tual Machines.In Proc.2nd Symposium on Networked Systems Design and Implementation(NSDI),Boston,MA,May2005. [Cully2008] B.Cully,G.Lefebvre, D.Meyer,M.Feeley, N.Hutchinson,and A.Warfield.Remus:High Availability via Asynchronous Virtual Machine Replication.In Proc.5th Sympo-sium on Networked Systems Design and Implementation(NSDI), San Francisco,CA,April2008.
[Dean2004]J.Dean and S.Ghemawat.MapReduce:Simplified Data Processing on Large Clusters.In Proc.6th Symposium on Operating System Design and Implementation(OSDI),San Francisco,CA,December2004.
[distcc]samba.org.distcc:a fast,free distributed C/C++compiler.
http://distcc.samba.org/.
[EBI]European Bioinformatics Institute.ClustalW2.http: //www.ebi.ac.uk/Tools/clustalw2/index.html. [EC2]Amazon.com.EC2:Amazon Elastic Compute Cloud.
http://aws.amazon.com/ec2/.
[Emeneker2007]W.Emeneker and D.Stanzione.Dynamic Virtual Clustering.In Proc.Cluster,Austin,TX,September2007. [Foster2006]I.Foster,T.Freeman,K.Keahey,D.Scheftner,B.So-tomayor,and X.Zhang.Virtual Clusters for Grid Communities.
In Proc.Cluster Computing and the Grid,Singapore,May2006.
[Gentzsch2001]W.Gentzsch.Sun Grid Engine:Towards Creating
a Compute Power Grid.In Proc.1st Symposium on Cluster
Computing and the Grid,Brisbane,Australia,May2001.
[Hadoop]Apache.org.Hadoop.http://hadoop.apache.
org/core/.
[Hibler2008]M.Hibler,R.Ricci,L.Stoller,J.Duerig,S.Gu-ruprasad,T.Stack,K.Webb,and J.Lepreau.Large-scale Vir-tualization in the Emulab Network Testbed.In Proc.USENIX Annual Technical Conference,Boston,MA,June2008. [Hibler2003]M.Hibler,L.Stoller,J.Lepreau,R.Ricci,and
C.Barb.Fast,Scalable Disk Imaging with Frisbee.In Proc.
USENIX Annual Technical Conference,San Antonio,TX,June 2003.
[Huang2007]W.Huang,Q.Gao,J.Liu,and D.K.Panda.High Performance Virtual Machine Migration with RDMA over Mod-ern Interconnects.In Proc.Cluster,Austin,TX,September 2007.
[Kozuch2002]M.Kozuch and M.Satyanarayanan.Internet Sus-pend/Resume.In Proc.4th Workshop on Mobile Computing Sys-tems and Applications(WMCSA),Callicoon,NY,June2002. [Lagar-Cavilla2007]H.A.Lagar-Cavilla,N.Tolia,E.de Lara, M.Satyanarayanan,and D.O’Hallaron.Interactive Resource-Intensive Applications Made Easy.In Proc.8rm International Middleware Conference,Newport Beach,CA,November2007.[McNett2007]M.McNett,D.Gupta,A.Vahdat,and G.V oelker.
Usher:An Extensible Framework for Managing Clusters of Vir-tual Machines.In Proc.21st LISA,Dallas,TX,November2007. [MPICH]Argonne National Laboratory.MPICH2.http:// www.mcs.anl.gov/research/projects/mpich2/.
[NCBI]National Center for Biotechnology Information.BLAST: Basic Local Alignment and Search Tool.http://blast.
ncbi.nlm.nih.gov/Blast.cgi.
[Pixar]RenderMan.https://renderman.pixar.com/. [Platform]Platform Computing.Platform EGO Home.http:// my.platform.com/products/platform-ego-de. [Quantlib]QuantLib:A Free/Open-source Library for Quantitative Finance.http://quantlib.org/index.shtml. [Sapuntzakis2002] C.P.Sapuntzakis,R.Chandra, B.Pfaff, J.Chow,M.S.Lam,and M.Rosenblum.Optimizing the Mi-gration of Virtual Computers.In Proc.5th Symposium on Oper-ating Systems Design and Implementation(OSDI),Boston,MA, December2002.
[SHRiMP]University of Toronto.SHRiMP:SHort Read Map-ping Package.http://compbio.cs.toronto.edu/ shrimp/.
[SnowFlock]University of Toronto.SnowFlock:Swift VM Cloning for Cloud Computing.http://sysweb.cs.
toronto.edu/snowflock.
[Steinder2007]M.Steinder,I.Whalley,D.Carrera,I.Gaweda, and D.Chess.Server Virtualization in Autonomic Management of Heterogeneous Workloads.In Proc.10th Integrated Network Management(IM)conference,Munich,Germany,2007. [Theimer1985]M.Theimer,K.Lantz,and D.Cheriton.Preempt-able Remote Execution Facilities for the V-System.In Proc.10th Symposium on Operating Systems Principles(SOSP),Orcas Is-land,WA,December1985.
[VMotion]VMware.VMotion:Migrate Virtual Machines with Zero Downtime.http://www.vmware.com/products/ vi/vc/vmotion.html.
[Vrable2005]M.Vrable,J.Ma,J.Chen,D.Moore,E.Vandekieft,
A.Snoeren,G.V oelker,and S.Savage.Scalability,Fidelity and
Containment in the Potemkin Virtual Honeyfarm.In Proc.20th Symposium on Operating Systems Principles(SOSP),Brighton, UK,October2005.
[Waldspurger2002]C.A.Waldspurger.Memory Resource Man-agement in VMWare ESX Server.In Proc.5th Symposium on Operating System Design and Implementation(OSDI),Boston, MA,2002.
[Warfield2005]A.Warfield,S.Hand,K.Fraser,and T.Deegan.
Facilitating the development of soft devices.In Proc.USENIX Annual Technical Conference,Anaheim,CA,April2005. [Wood2007]T.Wood,P.Shenoy, A.Venkataramani,and M.Yousif.Black-box and Gray-box Strategies for Virtual Machine Migration.In Proc.4th Symposium on Networked Systems Design and Implementation(NSDI),Cambridge,MA, April2007.
[Zayas1987]E.Zayas.Attacking the Process Migration Bottle-neck.In Proc.11th Symposium on Operating System Principles (SOSP),Austin,TX,November1987.