Containers Expo Blog Authors: Liz McMillan, Yeshim Deniz, Elizabeth White, Zakia Bouachraoui, Pat Romanski

Related Topics: Containers Expo Blog, Microservices Expo

Containers Expo Blog: Article

Virtual Packet Loss: The Silent Killer of Network Performance

With the right virtualization performance management tools in place, IT managers can better utilize their physical resources

As many enterprise IT managers have come to realize, server virtualization is both an operational blessing and a curse.

On the upside, virtualization enables IT managers to consolidate servers, create efficiencies, and reduce infrastructure costs. It's the foundation of the next-generation, highly dynamic data center - an agile infrastructure capable of supporting today's always-on and ever-demanding business operations - and a pathway to an internal cloud.

On the downside, virtualization often makes already overly complex IT environments even more so.

Virtual machine (VM) sprawl is a well-documented problem. With VMs so easy to spin up, stories are rampant about IT managers who have lost track of how many VMs - active or otherwise - are out and about in their environments. As they struggle to cope with the problem, some have brought virtualization deployments to a standstill. Industry watchers peg the average rate of virtualization at about 30 to 40% of the server infrastructure, where it will hover until management becomes less cumbersome, they say.

Also much discussed, virtualization can have a negative trickle-down effect on the underlying network and storage infrastructures. Even following suggested best practices, IT managers too often discover that they've underprovisioned the amount of networked storage they need to support virtual servers, for example. And in the absence of sophisticated, virtualization-oriented tools, they often don't recognize that network bandwidth is going to be in short supply until VM workloads start grabbing it away from applications as both types of traffic traverse the data center.

Moreover, because VM provisioning tools tend to focus on overall CPU load as the main indicator of resource scarcity, they overlook the fact that other effects, such as clock skew, can greatly affect performance of virtualized applications. Clock skew occurs when the hypervisor - essentially a sub operating system-level scheduler - attempts to schedule competing events from multiple virtualized systems simultaneously. In reality, only one event can precisely hit the requested deadline per physical hardware thread - the rest are scheduled as closely as possible. This deviation from the precise requested deadline represents the skew, which is especially challenging to identify and control. As a result, IT managers often end up underprovisioning server resources for the number of VMs in use. And when too many VMs are tapping into too few physical resources, performance degrades - a situation that happens quite often given the aforementioned problem of VM sprawl.

Underprovisioning server resources for VM loads leads to a little-recognized but insidious virtualization performance problem called virtual packet loss, or VPL.

As the hypervisor slices up CPU time for an overload of VMs, excessive context switching occurs and VM clocks get skewed. Each individual instance is unable to fulfill the strict deadlines required by TCP, which might result in an erroneous reduction in TCP throughput.

When packets are greatly delayed, the oversubscription problem reaches a tipping point. The network protocol stack, namely TCP, carefully times each round-trip on the connection. If the variance in these round-trips is low, as would be expected on a high-speed, locally connected network, the threshold for inferring packet loss is very sensitive, and congestion might be falsely inferred. In an attempt to relieve this non-existent congestion, the sender reduces its throughput drastically and attempts to retransmit the packets, sometimes resulting in a complete stall of data for at least a second. This is known as a retransmission timeout, or RTO. The problem only gets worse as oversubscription delays packets further, with the sender waiting an exponentially longer period of time with each RTO. These delays can have a devastating effect on business-critical transactions traversing the network. As you can see, VPL looks almost exactly like congestion-driven physical packet loss. Because throughput is reduced, the network appears to be in horrific shape, dropping packets left and right, even though no loss actually is occurring.

Although oversubscribing the server CPU is a prime instigator for VPL, underprovisioning at the storage and network layers is yet another culprit. Similar scheduling delays will occur, for example, if the connection between the hypervisor and the network-attached back-end storage it's trying to access is too slow.

Frankly, VPL can wreak havoc within a virtual data center.

Because virtualization creates opacity within the IT infrastructure, server, network and application teams often find themselves pitted against one another as they struggle to resolve performance issues. This phantom packet loss manifests itself as network-level performance degradation, and network administrators are often tasked with hunting down the problem. This task turns into an exercise in futility because the infrastructure monitoring tools they rely on to collect metrics from switches and routers will find no evidence of packet loss. They toss the ball back to the virtualization team saying, "Look. The problem has to be on your end because the network is clean." The server administrators in turn say, "No. You must be missing something. Look again." And on and on the argument goes while problem resolution remains frustratingly elusive. Sound familiar?

Some IT managers have been contending with degradation for so long - in some cases since firing up their first virtual servers - that they've even come to accept that a virtualization infrastructure simply runs more slowly than its physical counterpart. Because of VPL and other virtualization performance problems, they see less goodput, meaning more packets are required in the virtual than the physical environment to transmit the same amount of data.

VPL can be a perniciously intermittent problem, one that might affect some systems but not others or some systems just some of the time. Consider the amount of traffic flowing through the 10-Gigabit Ethernet links becoming commonplace in the enterprise data center, and you know in a heartbeat that tracking down VPL manually is an impossible task. Unfortunately, conventional monitoring and management tools don't offer any help tracking down VPL either.

Today's highly complex IT environments demand a new generation of tools that enable greater visibility, especially into the virtualization layer. With such tools in hand, IT managers can approach virtualization management in a proactive manner, detecting VM performance problems before they turn into big problems like VPL.

To curb VM sprawl, IT managers should be able to autodiscover VMs - of any ilk - across the application environment. To detect and get to the root cause of VPL, IT managers should have tools capable of performing full-stream reassembly and full-content analysis as well as the ability to conduct sophisticated Layer 2-7 traffic analysis, for example. Moreover, they should be able to run automatic trending and baselining, capabilities that will help them determine when to spin up new VMs and head off potential capacity problems.

With the right virtualization performance management tools in place, IT managers can better utilize their physical resources, redistributing VM loads among or between server clusters to eliminate VPL, and reach greater efficiencies in the data center. Goodput, after all, should apply to the virtual as much as the physical world.

More Stories By Tanya Bragin

Tanya Bragin is a Senior Product Manager at ExtraHop Networks. Previously, she was a Senior Consultant with Deloitte & Touche Enterprise Risk Services, deploying application performance management solutions for Fortune 100 clients. She received her Masters in Computer Science from the University of Washington with a concentration on designing large-scale service-oriented systems.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

IoT & Smart Cities Stories
Moroccanoil®, the global leader in oil-infused beauty, is thrilled to announce the NEW Moroccanoil Color Depositing Masks, a collection of dual-benefit hair masks that deposit pure pigments while providing the treatment benefits of a deep conditioning mask. The collection consists of seven curated shades for commitment-free, beautifully-colored hair that looks and feels healthy.
The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
The textured-hair category is inarguably the hottest in the haircare space today. This has been driven by the proliferation of founder brands started by curly and coily consumers and savvy consumers who increasingly want products specifically for their texture type. This trend is underscored by the latest insights from NaturallyCurly's 2018 TextureTrends report, released today. According to the 2018 TextureTrends Report, more than 80 percent of women with curly and coily hair say they purcha...
We all love the many benefits of natural plant oils, used as a deap treatment before shampooing, at home or at the beach, but is there an all-in-one solution for everyday intensive nutrition and modern styling?I am passionate about the benefits of natural extracts with tried-and-tested results, which I have used to develop my own brand (lemon for its acid ph, wheat germ for its fortifying action…). I wanted a product which combined caring and styling effects, and which could be used after shampo...
The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.