Welcome!

Containers Expo Blog Authors: Elizabeth White, Liz McMillan, Pat Romanski, Yeshim Deniz, Amit Gupta

Related Topics: Containers Expo Blog, @CloudExpo

Containers Expo Blog: Article

Why Your Storage Isn't Always to Blame

CRC Errors, Code Violation Errors, Class 3 Discards & Loss of Sync

Storage is often automatically pinpointed as the source of all problems. From System Admins, DBAs, Network guys to Application owners, all are quickly ready to point the figure at SAN Storage given the slightest hint of any performance degradation. Not really surprising though, considering it's the common denominator amongst all silos. On the receiving end of this barrage of accusation is the SAN Storage team, who are then subjected to hours of troubleshooting only to prove that their Storage wasn't responsible. On this circle goes until there reaches a point when the Storage team are faced with a problem that they can't absolve themselves of blame, even though they know the Storage is working completely fine. With array-based management tools still severely lacking in their ability to pinpoint and solve storage network related problems and with server based tools doing exactly that i.e. looking at the server, there really is little if not nothing available to prove that the cause of latency is a slow draining device such as a flapping HBA, damaged cable or failing SFP. Herein lies the biggest paradox in that 99% of the time when unidentifiable SAN performance problems do occur, they are usually linked to trivial issues such as a failing SFP.  In a 10,000 port environment, the million dollar question is ‘where do you begin to look for such a miniscule needle in such a gargantuan haystack?'

To solve this dilemma it's imperative to know what to look for and have the right tools to find them, enabling your SAN storage environment to be a proactive and not a reactive fire-fighting / troubleshooting circus. So what are some of the metrics and signs that should be looked for when the Storage array, application team and servers all report everything as fine yet you still find yourself embroiled in performance problems?

Firstly to understand the context of these metrics / signs and the make up of FC transmissions, let's use the analogy of a conversation. Firstly the Frames would be considered the words, the Sequences the sentences and an Exchange the conversation that they are all part of. With that premise it is important to first address the most basic of physical layer problems, namely Code Violation Errors. Code Violation Errors are the consequence of bit errors caused by corruption that occur in the sequence - i.e. any character corruption. A typical cause of this would be a failing HBA that would eventually start to suffer from optic degradation prior to its complete failure. I also recently experienced at one site Code Violation Errors when several SAN ports had been left enabled after their servers had been decommissioned. Some might think what's the problem if they have nothing connected to them? In fact this scenario was creating millions of Code Violation Errors causing a CPU overhead on the SAN switch and subsequent degradation. With mission critical applications connected to the same SAN switch, performance problems became rife and without the identification of the Code Violation Errors could have led to weeks of troubleshooting with no success.

The build up of Code Violation Errors become even more troublesome as they eventually lead to what is referred to as a Loss of Sync. A Loss of Sync is usually indicative of incompatible speeds between points and again this is typical of optic degradation in the SAN infrastructure. For example if an SFP is failing, its optic signal will degrade and hence will not be at for example the 4Gbps it's set at. Case point: a transmitting device such as a HBA is set at 4Gbps while the receiving end i.e. the SFP (unbeknownst to the end user) has degraded down to 1Gbps. Severe performance problems will occur as the two points constantly struggle with their incompatible speeds. Hence it's an imperative to be alerted of any Loss of Sync as ultimately they are also an indication of an imminent Loss of Signal i.e. when the HBA or SFP are flapping and are about to fail. This leads to the nightmare scenario of an unplanned path failure in your SAN storage environment and worse still a possible outage if failover cannot occur.

One of the biggest culprits and a sure-fire hit to resolving performance problems is to look for what are termed CRC errors. CRC Errors usually indicate some kind of physical problem within the FC link and are indicative of code violation errors that have led to consequent corruption inside the FC data frame. Usually caused by a flapping SFP or a very old / bent / damaged cable, once CRC errors are acknowledged by the receiver, the receiver would reject the request leaving the Frame having to be resent. For example as an analogy imagine a newspaper delivery boy, who while cycling to his destination loses some of the pages of the paper prior to delivery. Upon delivery the receiver would request for the newspaper to be redelivered with the missing pages. This would entail the delivery boy having to cycle back to find the missing pages and bring back the newspaper as a whole. In the context of a CRC error a Frame that should typically take only a few milliseconds to deliver could take up to 60 seconds in being rejected and resent. Such response times can be catastrophic to a mission critical application and it's underlying business. By gaining an insight into CRC errors and their root cause one can immediately pinpoint which bent cable or old SFP is responsible and proactively replace them long before they start to cause poor application response times or even worse a loss to your business.

The other FC SAN gremlin is what is termed a Class 3 discard. Of the various services of data transport defined by the Fibre Channel ANSI Standard, the most commonly used is Class 3. Ideal for high throughput, Class-3 is essentially a datagram service based on frame switching and is a connectionless service. Class 3's main advantage comes from not giving an acknowledgement that a frame has been rejected or busied by a destination device or Fabric. The benefits of this are that it firstly significantly reduces the overhead on the transmitting device and secondly allows for more bandwidth availability for transmission which would otherwise be reduced. Furthermore the lack of acknowledgements removes the potential delays between devices caused by round-trips of information transfers. As for data integrity, Class 3 Flow control has this handled by higher-level protocols such as TCP due to Fibre Channel not checking the corrupted or missing frames. Hence any discovery of a corrupted packet by the higher-level protocol on the receiving device instantly initiates a retransmission of the sequence. All of this sounds great until the non-acknowledgement of rejected frames starts to also bring about Class 3's disadvantage. This is that inevitably a Fabric will become busy with traffic and will consequently discard frames, hence the name Class 3 discards. Due to this the receiving device's higher-level protocol's subsequent request for retransmission of sequences will then degrade the device and fabric throughput.

Another indication of Class 3 discards are zoning conflicts where a frame has been transmitted and cannot reach a destination, hence concluding in the SAN initiating a Class 3 discard. This is caused by either legacy or zoning mistakes where for example a decommissioned Storage system was not unzoned from a server or vice versa leading to continuous frames being discarded and degraded throughput as sequences are retransmitted. This then results in performance problems, potential application degradation and automatic finger pointing at the Storage System for a problem that can't automatically be identified. By resolving the zoning conflict and spreading the load of the SAN throughput across the right ports, the heavy traffic or zoning issues which cause the Class 3 discards can be quickly removed bringing immediate performance and throughput improvements. By gaining an insight into the occurrence and amount of Class 3 discards, huge performance problems can be quickly remediated before they occur and thus another reason as to why the Storage shouldn't automatically be blamed.

These are just some of the metrics / signs to look for which can ultimately save you from weeks of troubleshooting and guessing. By first acknowledging these metrics, identifying when they occur and proactively eliminating them, the SAN storage environment will quickly evolve and transform into a healthy, proactive and optimized one. Furthermore by eliminating each of these issues you also empower yourself by eliminating their consequent problems such as application slowdown, poor response times, unplanned outages and long drawn out troubleshooting exercises which eventually lead to fingerpointing fights. Ideally what will occur is a paradigm shift where instead of application owners complaining to the Storage team, the Storage team will proactively identify problems prior to their existence. Here lies the key to making the ‘always blaming the Storage' syndrome a thing of the past.

More Stories By Archie Hendryx

SAN, NAS, Back Up / Recovery & Virtualisation Specialist.

@ThingsExpo Stories
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Taica will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Taica manufacturers Alpha-GEL brand silicone components and materials, which maintain outstanding performance over a wide temperature range -40C to +200C. For more information, visit http://www.taica.co.jp/english/.
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
As popularity of the smart home is growing and continues to go mainstream, technological factors play a greater role. The IoT protocol houses the interoperability battery consumption, security, and configuration of a smart home device, and it can be difficult for companies to choose the right kind for their product. For both DIY and professionally installed smart homes, developers need to consider each of these elements for their product to be successful in the market and current smart homes.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
Digital transformation is changing the face of business. The IDC predicts that enterprises will commit to a massive new scale of digital transformation, to stake out leadership positions in the "digital transformation economy." Accordingly, attendees at the upcoming Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA, Oct 31-Nov 2, will find fresh new content in a new track called Enterprise Cloud & Digital Transformation.
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, will discuss how given the magnitude of today's applicati...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
SYS-CON Events announced today that Avere Systems, a leading provider of enterprise storage for the hybrid cloud, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Avere delivers a more modern architectural approach to storage that doesn't require the overprovisioning of storage capacity to achieve performance, overspending on expensive storage media for inactive data or the overbui...