Close Window

Print Story

Next-Generation Linux Clusters

At a high level, a cluster is an aggregation of multiple standalone computers (nodes) linked together through a high-speed connection to create a single shared computing resource. A key benefit of this distributed processing architecture is that complex computations can be broken down and run serial or parallel among the individual nodes, resulting in a dramatic improvement in the time required to process large problems and complex tasks. For that reason, clusters are used for CPU-intensive jobs where massive processing power is required, such as when running simulations, computer-aided design, and rendering.

Clustered system architecture has steadily gained in popularity over the last decade for both high-performance and availability applications, partly because of the high return on investment (ROI) gained by using clusters. One reason for this high ROI lies with starting with dramatically lower capital costs. Most clusters are based on Linux, which provides a low-overhead licensing structure. Another reason many organizations choose clusters is the lower hardware costs of the most popular type of computational-intensive clusters, Beowulf clusters, given the economies of scales of x86-based servers. As will be seen as one reads this article, maximizing the ROI of Linux clustering is all about vastly simplifying the deployment and management of clusters to drive productivity up and operational expense down.

Thomas Sterling and Donald Becker developed the Beowulf architecture while at NASA. Beowulf clusters use standard commercial off-the-shelf (COTS) hardware, usually identical computers, running Linux. The nodes are networked into a small TCP/IP LAN and have libraries and programs installed that let processing be shared among them.

This type of clustered system architecture combines the power of many COTS systems to form a single more powerful system and can reach and exceed the performance of traditional SMP and vector-based supercomputers. But it's also far less expensive to reach that same performance level. For example, $1 million can buy well over 1,000 processors using leading-edge AMD- or Intel-based servers. But you can't even approach that "bang for a buck" with shared-memory multiprocessor (SMP) systems. The cost savings means individual departments can afford their own "supercomputers." By some estimates, overall costs for a COTS-based cluster may be a third to a tenth the price of a traditional supercomputer.

Using COTS technology also helps users take advantage of several trends, including improved technology (like microprocessor performance and memory density) and commodity networking, especially gigabit Ethernet, soon to be followed by 10Gb, (which makes it possible to design distributed-memory systems with tolerable bandwidths and latencies). The result is a compelling order-of-magnitude cost savings for a very powerful and far more accessible (for the typical user) high-performance computing (HPC) center.

Next-Generation Clusters
Many scientific and technical applications moved to first-generation clusters in the late 1990s, including fluid dynamics, stress and deformation analysis, seismic and reservoir modeling, large-scale operational research, molecular and protein modeling, and financial analytics. More recently, the digital content creation market has become primarily Linux cluster-based. Before clusters were widely used, these applications ran on vector computers, on large-scale SMPs, and, to a lesser extent, on massively parallel systems (MPPs). These classes of machines shared high initial investment costs and training of personnel and maintenance often doubled or tripled the initial capital expenditure. Clusters significantly reduced capital costs as well as ongoing administration costs.

There was another more substantial difference between Linux clusters and other options available at the time, though. The first Linux clusters were designed so that end users could gain full control of their tools and have access to supercomputer performance without the costs or burden of dealing with such complex and expensive platforms. Biochemists, fluid dynamics experts, mechanical engineers, and others in research labs and universities were at the forefront of this silent revolution - not computer scientists. Linux clusters were being designed by the users for the user.

By the end of the '90s, however, Linux clusters started to become interesting to Fortune 500 companies as well, expanding beyond the relatively small arena of research and education. However, the first-generation cluster architecture that came out of the Beowulf project at NASA only addressed the problem of getting machines to work together, making sure the networking code worked, and building the communication library layer. There were several areas where first-generation clusters needed improvement, not the least of which was ease of use for the less technical user. (Figure 1)

One key improvement next-generation clusters had over first-generation designs was the illusion of a single system that makes management easier. In a necessarily simplified fashion, some key elements that were required to set up an easy-to-use and administer cluster by creating the illusion of a single system include:

Some non-commercial cluster systems create a partial single system illusion by requiring network virtual memory or a consistent global file system, or implementing transparent process migration. However, these designs handle failure poorly because the system must go through a time-consuming lock recovery process or even kill all processes related to the failed machine if any of the nodes fail.

Because end users built the first generation of clusters, they continued to build or grow their own ad hoc clusters as the needs of their organizations dictated. Ad hoc do-it-yourself clusters are cumbersome to use and generally adapted only to batch processing. But usage habits change. HPC clusters have been effectively adapted for batch computing environments but, until recently, have not been well suited for interactive use. This is why batch schedulers have been so prevalent - the issue for users is that they are presented in a traditional cluster with 100 separate servers for a 100-node cluster. Managing jobs and their related processes, data, and other things means manually interacting with all of those machines, a tedious and time-consuming process.


This complexity is even worse for the admin who must set up and maintain the cluster. The admin is presented with 100 separate machines that must be loaded with the operating system and cluster management software and this can take five-30 minutes for each machine if everything goes well. Then all of the servers must be configured for access, users, security, and coordinating as a cluster. Extensive scripting is usually required and different people must maintain these scripts over time adding overhead and cost.

What's more many do-it-yourself solutions are simply open source projects and lack important formal support and documentation. There are many associated costs in terms of time and resources that project-level solutions carry over the lifetime of the cluster. In addition, most clusters get different parts of the cluster stack from several different suppliers so customers face the additional challenge of who to turn to for support when problems occur.

It's far more cost-effective to use a commercially developed and fully supported solution that can simplify the management of large pools of servers, leveraging the best of open source and value-add proprietary software, and wrap it all together in a well-architected, pre-integrated, documented, and supported solution. And that's what happened after the initial development of clusters. Next-generation clusters architected from the ground up to be "commercial-grade" don't face any of these issues listed above, benefiting both commercial and non-commercial users.

In addition, as the market matured and more complete solutions have come available, the value proposition for Linux clusters has expanded to include significant total cost of ownership (TCO) savings spanning hardware, software, and support costs; flexibility in configuration and upgrade options; freedom from constraints of single-vendor development schedules and support; greater flexibility through open source software customization; and rapid performance advances in industry-standard processor, storage, and networking technologies.

Maximizing Return on Investment
Next-generation clusters, almost all of which are commercially developed, offer several specific areas of additional value compared to previous generations. This is due in part to optimization of the initial first-generation Beowulf architecture into software that's properly architected, very tightly integrated to work out-of-the-box, and offers additional features that vastly simplify the deployment and management of large pools of servers. By examining the benefits of some of the architectural concepts of next-generation systems for compute clustering in more detail, we can understand their advantages.

Faster, Better, Cheaper Deployment
It's dramatically easier and faster to provision and manage large server pools when you don't install a full operating environment on the hard disks, the path chosen by some leading-edge cluster software vendors. A full OS installation to the disk drive is relatively slow, generally taking 15 to 30 minutes to complete depending on a variety of factors. This can be scripted or even parallelized but it'll still take hours to provision an entire cluster once all the preparations are completed.

In some next-generation systems, though, the installation is completely integrated and needs to be done only once and only on the designated Master node, regardless of the size of the cluster. The compute nodes are then auto-provisioned with a cluster-aware pre-configured operating system environment directly to memory and no further configuration is generally required since the cluster software sets up the entire configuration. This process takes approximately 20 seconds for each node and they are ready to run.

The value of rapid stateless provisioning is even more apparent during ongoing cluster operation when compute nodes fail, are updated, or are re-provisioned or new nodes are added. The stateless approach is effortless and nimble in comparison to a traditional local full-install. A full disk-based install on a replacement or new server is going to take the same 15-30 minutes as an original install and will generally have to be scheduled by one's IT staff, when they can get to it. This increases the time required to restore the cluster to full operation.

In a next-generation architecture, software updates only need be applied to the Master, which will then auto-update the compute nodes quickly and on-demand. The method used in most architectures, however, requires elaborate scripting at best and increases the chance that something can go wrong part-way through. This will result in another problem, version skew. The correctness of the applications that run on multiple compute nodes is often dependent on everything being precisely the same on each processing element. The tiniest difference in a driver or library can render the results useless after days of calculation - if the application runs at all.

Depending on local full-install operating environments is one sure way to create problems of version skew. In the end, you may end up spending hours reinstalling everything fresh and getting back to a known state. Local copies get stale and inconsistent causing wasted time and costly rework. However, the latest generation of cluster software generally provisions to the nodes exactly what is running on the master and inherently manages dependencies and versioning so there's guaranteed consistency.

Greater Performance
One of the immediate benefits of the lightweight compute nodes mentioned as key components of next-generation clusters is performance. Part of the reason that modern compute nodes can be provisioned in less than 20 seconds is that the OS is significantly smaller. Since compute cluster nodes aren't general-purpose machines, they don't need most of the software provided in a full Linux distribution.

Related to performance is improved memory use, especially important if insufficient memory forces an application to swap space. On a typical cluster compute node, next-generation clusters can begin with a very small footprint of memory (16MB) and dynamically add support only as required as compared with 400MB for a full static Linux installation.

A more significant issue is the scheduling latencies that many of the standard Linux services can introduce over long-running applications. It's been shown that these scheduling latencies can cut the cluster performance of real-world applications by 5%-50% as cluster configurations scale out and they can be impractical to isolate since they are very application-dependent.

Enhanced Scalability
Scale-out performance is greatly improved due to the unique design of these new architectures. Some designs employ a single primary daemon on compute slaves and leverage this daemon to run jobs, get standard I/O, and logs and statistics out on those slaves, all from the Master node. Enhanced scalability occurs because compute nodes can be added on-demand and common tools can be optimized to leverage the support already built into the architecture. As long as the networking and storage infrastructure is designed correctly, there's no bottleneck created in the system to compromise scalability.

Simpler, Faster Administration by Design
Another way in which modern clusters enable scalability is tied to their ease of use. By using virtualization, they make large pools of servers act and feel as if they were a single consistent virtual system. At least one next-generation cluster software package creates "single system image" behavior with the Linux you already know. It does this by extending the Linux configuration on the Master node to have a single unified process space. From both the administrator and the user's point-of-view, a 100-node cluster with 400 processors appears very much like a 400-processor machine.

So the compute servers are fully transparent and directly accessible if needs be. However, if you're interested in the compute capacity presented at the single Master node, you need look no further than this one machine. The simplest way to convey how this might manifest itself to the typical user is to offer the example of the everyday task of issuing the ubiquitous "ps" process list command. After issuing the command, what you get back is a listing of all processes running on all machines as if they were just one machine. You can still tell which processes are running where in the cluster with a simple addition to the command, but only if you really care to know it. Other standard Linux commands work in the same intuitive way as on a single machine.

When you want to add a user and set up passwords, you only have to do it on the Master. When you want to run a job, you run it on the Master and simply tell it how many processors you need (even non-MPI jobs). If you need to terminate a job, it's done on the Master node and automatically removed cleanly on the compute node(s) it actually resides on. Of course, you can run jobs or general commands on specific nodes if you have to. If you need to see the vital statistics of load, memory usage, disk usage, etc. on any or all nodes, add one command line or GUI invocation on the Master node and you get it.

Conclusion
The latest generation of commercial clusters re-architects the foundation of cluster architecture using several well-recognized concepts in a unique combination that delivers virtualized cluster systems that make large pools of servers appear and act like a single consistent virtual machine. A properly architected solution that leverages stateless provisioning and a lightweight compute operating environment that simulates the appearance and end-user experience of a single virtual system has a tremendous ripple effect on rapid provisioning, manageability, scalability, security, and reliability within the cluster.

The result is an elegantly simple and powerful new paradigm of virtualized clustered computing. This new paradigm eliminates the need for multiple levels of cost and support. It also dramatically increases efficiency and reduces operating costs while delivering a dependable HPC service to organizations in highly competitive business environments.

© 2008 SYS-CON Media Inc.