Welcome!

Containers Expo Blog Authors: Mehdi Daoudi, Elizabeth White, John Rauser, Pat Romanski, Liz McMillan

Related Topics: Containers Expo Blog, @CloudExpo

Containers Expo Blog: Article

Why Your Storage Isn't Always to Blame

CRC Errors, Code Violation Errors, Class 3 Discards & Loss of Sync

Storage is often automatically pinpointed as the source of all problems. From System Admins, DBAs, Network guys to Application owners, all are quickly ready to point the figure at SAN Storage given the slightest hint of any performance degradation. Not really surprising though, considering it's the common denominator amongst all silos. On the receiving end of this barrage of accusation is the SAN Storage team, who are then subjected to hours of troubleshooting only to prove that their Storage wasn't responsible. On this circle goes until there reaches a point when the Storage team are faced with a problem that they can't absolve themselves of blame, even though they know the Storage is working completely fine. With array-based management tools still severely lacking in their ability to pinpoint and solve storage network related problems and with server based tools doing exactly that i.e. looking at the server, there really is little if not nothing available to prove that the cause of latency is a slow draining device such as a flapping HBA, damaged cable or failing SFP. Herein lies the biggest paradox in that 99% of the time when unidentifiable SAN performance problems do occur, they are usually linked to trivial issues such as a failing SFP.  In a 10,000 port environment, the million dollar question is ‘where do you begin to look for such a miniscule needle in such a gargantuan haystack?'

To solve this dilemma it's imperative to know what to look for and have the right tools to find them, enabling your SAN storage environment to be a proactive and not a reactive fire-fighting / troubleshooting circus. So what are some of the metrics and signs that should be looked for when the Storage array, application team and servers all report everything as fine yet you still find yourself embroiled in performance problems?

Firstly to understand the context of these metrics / signs and the make up of FC transmissions, let's use the analogy of a conversation. Firstly the Frames would be considered the words, the Sequences the sentences and an Exchange the conversation that they are all part of. With that premise it is important to first address the most basic of physical layer problems, namely Code Violation Errors. Code Violation Errors are the consequence of bit errors caused by corruption that occur in the sequence - i.e. any character corruption. A typical cause of this would be a failing HBA that would eventually start to suffer from optic degradation prior to its complete failure. I also recently experienced at one site Code Violation Errors when several SAN ports had been left enabled after their servers had been decommissioned. Some might think what's the problem if they have nothing connected to them? In fact this scenario was creating millions of Code Violation Errors causing a CPU overhead on the SAN switch and subsequent degradation. With mission critical applications connected to the same SAN switch, performance problems became rife and without the identification of the Code Violation Errors could have led to weeks of troubleshooting with no success.

The build up of Code Violation Errors become even more troublesome as they eventually lead to what is referred to as a Loss of Sync. A Loss of Sync is usually indicative of incompatible speeds between points and again this is typical of optic degradation in the SAN infrastructure. For example if an SFP is failing, its optic signal will degrade and hence will not be at for example the 4Gbps it's set at. Case point: a transmitting device such as a HBA is set at 4Gbps while the receiving end i.e. the SFP (unbeknownst to the end user) has degraded down to 1Gbps. Severe performance problems will occur as the two points constantly struggle with their incompatible speeds. Hence it's an imperative to be alerted of any Loss of Sync as ultimately they are also an indication of an imminent Loss of Signal i.e. when the HBA or SFP are flapping and are about to fail. This leads to the nightmare scenario of an unplanned path failure in your SAN storage environment and worse still a possible outage if failover cannot occur.

One of the biggest culprits and a sure-fire hit to resolving performance problems is to look for what are termed CRC errors. CRC Errors usually indicate some kind of physical problem within the FC link and are indicative of code violation errors that have led to consequent corruption inside the FC data frame. Usually caused by a flapping SFP or a very old / bent / damaged cable, once CRC errors are acknowledged by the receiver, the receiver would reject the request leaving the Frame having to be resent. For example as an analogy imagine a newspaper delivery boy, who while cycling to his destination loses some of the pages of the paper prior to delivery. Upon delivery the receiver would request for the newspaper to be redelivered with the missing pages. This would entail the delivery boy having to cycle back to find the missing pages and bring back the newspaper as a whole. In the context of a CRC error a Frame that should typically take only a few milliseconds to deliver could take up to 60 seconds in being rejected and resent. Such response times can be catastrophic to a mission critical application and it's underlying business. By gaining an insight into CRC errors and their root cause one can immediately pinpoint which bent cable or old SFP is responsible and proactively replace them long before they start to cause poor application response times or even worse a loss to your business.

The other FC SAN gremlin is what is termed a Class 3 discard. Of the various services of data transport defined by the Fibre Channel ANSI Standard, the most commonly used is Class 3. Ideal for high throughput, Class-3 is essentially a datagram service based on frame switching and is a connectionless service. Class 3's main advantage comes from not giving an acknowledgement that a frame has been rejected or busied by a destination device or Fabric. The benefits of this are that it firstly significantly reduces the overhead on the transmitting device and secondly allows for more bandwidth availability for transmission which would otherwise be reduced. Furthermore the lack of acknowledgements removes the potential delays between devices caused by round-trips of information transfers. As for data integrity, Class 3 Flow control has this handled by higher-level protocols such as TCP due to Fibre Channel not checking the corrupted or missing frames. Hence any discovery of a corrupted packet by the higher-level protocol on the receiving device instantly initiates a retransmission of the sequence. All of this sounds great until the non-acknowledgement of rejected frames starts to also bring about Class 3's disadvantage. This is that inevitably a Fabric will become busy with traffic and will consequently discard frames, hence the name Class 3 discards. Due to this the receiving device's higher-level protocol's subsequent request for retransmission of sequences will then degrade the device and fabric throughput.

Another indication of Class 3 discards are zoning conflicts where a frame has been transmitted and cannot reach a destination, hence concluding in the SAN initiating a Class 3 discard. This is caused by either legacy or zoning mistakes where for example a decommissioned Storage system was not unzoned from a server or vice versa leading to continuous frames being discarded and degraded throughput as sequences are retransmitted. This then results in performance problems, potential application degradation and automatic finger pointing at the Storage System for a problem that can't automatically be identified. By resolving the zoning conflict and spreading the load of the SAN throughput across the right ports, the heavy traffic or zoning issues which cause the Class 3 discards can be quickly removed bringing immediate performance and throughput improvements. By gaining an insight into the occurrence and amount of Class 3 discards, huge performance problems can be quickly remediated before they occur and thus another reason as to why the Storage shouldn't automatically be blamed.

These are just some of the metrics / signs to look for which can ultimately save you from weeks of troubleshooting and guessing. By first acknowledging these metrics, identifying when they occur and proactively eliminating them, the SAN storage environment will quickly evolve and transform into a healthy, proactive and optimized one. Furthermore by eliminating each of these issues you also empower yourself by eliminating their consequent problems such as application slowdown, poor response times, unplanned outages and long drawn out troubleshooting exercises which eventually lead to fingerpointing fights. Ideally what will occur is a paradigm shift where instead of application owners complaining to the Storage team, the Storage team will proactively identify problems prior to their existence. Here lies the key to making the ‘always blaming the Storage' syndrome a thing of the past.

More Stories By Archie Hendryx

SAN, NAS, Back Up / Recovery & Virtualisation Specialist.

@ThingsExpo Stories
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, presented a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to ma...
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assista...
When growing capacity and power in the data center, the architectural trade-offs between server scale-up vs. scale-out continue to be debated. Both approaches are valid: scale-out adds multiple, smaller servers running in a distributed computing model, while scale-up adds fewer, more powerful servers that are capable of running larger workloads. It’s worth noting that there are additional, unique advantages that scale-up architectures offer. One big advantage is large memory and compute capacity...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
IoT solutions exploit operational data generated by Internet-connected smart “things” for the purpose of gaining operational insight and producing “better outcomes” (for example, create new business models, eliminate unscheduled maintenance, etc.). The explosive proliferation of IoT solutions will result in an exponential growth in the volume of IoT data, precipitating significant Information Governance issues: who owns the IoT data, what are the rights/duties of IoT solutions adopters towards t...
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, discussed some of the security challenges of the IoT infrastructure and related how these aspects impact Smart Living. The material was delivered interac...
No hype cycles or predictions of zillions of things here. IoT is big. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, Associate Partner at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He discussed the evaluation of communication standards and IoT messaging protocols, data analytics considerations, edge-to-cloud tec...
The Internet giants are fully embracing AI. All the services they offer to their customers are aimed at drawing a map of the world with the data they get. The AIs from these companies are used to build disruptive approaches that cannot be used by established enterprises, which are threatened by these disruptions. However, most leaders underestimate the effect this will have on their businesses. In his session at 21st Cloud Expo, Rene Buest, Director Market Research & Technology Evangelism at Ara...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 21st Int\ernational Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to focus on the core of their ...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...