Welcome!

Containers Expo Blog Authors: Pat Romanski, PagerDuty Blog, Elizabeth White, Liz McMillan, Yeshim Deniz

Related Topics: Containers Expo Blog, Microservices Expo, Microsoft Cloud, Linux Containers, Cloud Security, SDN Journal

Containers Expo Blog: Article

Data Efficiency at Scale

Overcoming limitations in data efficiency features

The initial wave of data efficiency features for primary storage focus on silos of information organized in terms of individual file systems. Deduplication and compression features provided by some vendors are limited by the scalability of those underlying file systems, essentially the file systems have become silos of optimized data. For example, NetApp deduplication can't scale beyond a 100 TB limit, because that's the limit in size of its WAFL file system. But ask anyone who's ever used NetApp deduplication if they've done it on a 100 TB file system, and you're likely to hear "are you crazy?" It's one thing to claim that data efficiency features can scale, quite a different one to actually use them with performance at scale.

Challenges around scalability generally center on two areas: scalability of random IO and memory overhead. Older solutions, like the one from NetApp, face the first challenge while newer flash-based storage systems are struggling with the second. I'll review both here:

The IO Challenge
Primary data-oriented storage devices handle both streaming and random throughput and therefore are sensitive to latency effects. Data efficiency requirements for primary storage must have fast hashing techniques to reduce the impact of latency. Fast hashes are non-cryptographic in nature and so require data comparison when used to do deduplication. It works like this:

  1. When a new chunk of data is read in it is first given a name using the hash algorithm.
  2. The system then checks a deduplication index to see if a chunk with that name has been seen before (note that this can consume disk IO and tremendous amounts of memory if done wrong).
  3. If the name has been seen we need to take extra steps. Because fast hashes are non-cryptographic, it is possible to have a name match while the data content differs. This is known in computer science as a hash-collision. To account for this, the existing copy of the chunk must be read in and compared bit-by-bit to the new. If they match, only a reference to the chunk is created. If not, then the new chunk must be written.

Essentially, this form of deduplication means trading a write of a duplicate chunk for a read. Depending on the design of the underlying block virtualization layer, duplicate chunks may be widely dispersed throughout the system. In that case, the bigger the system gets, the more expensive reads get - so processing of duplicate data becomes slower and slower as the storage system fills - this is why you won't find many 100 TB NetApp file systems with deduplication turned on. Certainly not for primary storage applications, the system would be flooded with random read requests and NetApp's deduplication process can end up taking months, years or even never complete.

A number of techniques have been used to reduce the impact of IO in other products. For example, the Hitachi NAS (HNAS) and Hitachi Unified Storage (HUS) solutions from HDS make use of hardware-acceleration to generate cryptographically secure hashes that do not require a data compare at all - this allows for linear scaling of deduplication performance on volumes up to 256 TB in size. Data is also written out before it is deduplicated to avoid introducing any latency through the hash computation process itself.

Permabit's own Albireo Virtual Data Optimizer (VDO) product, a plug-in module for Linux-based storage solutions, takes a different approach but with a similar result. VDO works inline to provide immediate data reduction. When data is written out, the VDO process intelligently lays it out in a sequential pattern, so that subsequent read compares of duplicates are more likely to be sequential as well. Both solutions do a fine job at solving the problem in real world scenarios, they just take different approaches.

The Memory Challenge
Many of today's flash array vendors are providing deduplication using similar fast hashing techniques to what I outlined above. With flash, the cost of doing random reads for read compares is a non-issue (random seeks on flash are much less expensive than for hard drive environments) so the use of the fast hash alone is enough to minimize latency. These systems (such as EMC's recently launched XtremIO product) are focused on delivering performance and the big challenge to performance at scale is available memory (DRAM). As above, after chunks are read in, they are named using a fast hashing algorithm. After that, the flash system must determine whether or not a chunk has been seen before. To get at this information as quickly as possible, flash-based storage systems have tended to use huge amounts of DRAM to cache chunk names in memory. It's not uncommon to see flash storage systems that allocate 16 GB of working cache per TB of storage. To support a 256 TB storage volume, such a system would require a TBs of DRAM. The increased hard costs in terms of more expensive (denser) DIMMS, as well as the increased cost of the server board required to support this many DIMMs combine to make this an extremely costly and unpopular proposition. Combine this with the fact that DRAM prices are not falling at the same rate as flash prices, and you can see why no vendor today makes a 256TB flash storage array with global deduplication capabilities.

The solution to the memory challenge is coming, in the form of a next generation of flash storage products that utilize Albireo indexing and Albireo VDO. Unlike the flash arrays described above, flash-optimized arrays with VDO takes advantage of advanced caching techniques to operate with 128 MB of working cache per TB of storage and deliver excellent performance. With VDO, a 256 TB system can be delivered with as little as 32 GB of RAM while delivering 1M IOPS performance. The net result is a cost effective and easily deployed data efficiency solution for flash arrays.

Conclusion

Deduplication Scalability by Vendor

As you can see in the table above, forward thinking vendors like HDS have done a good job at overcoming limitations in their data efficiency features and have products on the market today that can scale to meet the requirements of the large enterprise. Many other vendors are lagging behind, because of their inability to address IO and/or memory requirements, a serious downfall since data efficiency is at the core of distinguishing storage solutions, a critical end user requirement, and a ‘must have' component for 2014. Permabit's VDO product overcomes both of these limitations through the use of advanced memory-efficient caching techniques.

More Stories By Louis Imershein

As Senior Director of Product Strategy at Permabit Technology Corporation, Louis Imershein is responsible for product evolution and strategic planning for the Albireo family of products. He has 22 years of technical leadership experience in product management, software development and support. Prior to joining Permabit, Imershein was a Senior Product Marketing Manager for the Sun Microsystems Data Management Group. He has a Bachelor's degree in Biological Science from the University of California, Santa Cruz.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
The Internet giants are fully embracing AI. All the services they offer to their customers are aimed at drawing a map of the world with the data they get. The AIs from these companies are used to build disruptive approaches that cannot be used by established enterprises, which are threatened by these disruptions. However, most leaders underestimate the effect this will have on their businesses. In his session at 21st Cloud Expo, Rene Buest, Director Market Research & Technology Evangelism at Ara...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, will introduce two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a mu...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Nihon Micron will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nihon Micron Co., Ltd. strives for technological innovation to establish high-density, high-precision processing technology for providing printed circuit board and metal mount RFID tags used for communication devices. For more inf...
SYS-CON Events announced today that Ryobi Systems will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ryobi Systems Co., Ltd., as an information service company, specialized in business support for local governments and medical industry. We are challenging to achive the precision farming with AI. For more information, visit http:...
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...
SYS-CON Events announced today that Keisoku Research Consultant Co. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Keisoku Research Consultant, Co. offers research and consulting in a wide range of civil engineering-related fields from information construction to preservation of cultural properties. For more information, vi...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that Daiya Industry will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Daiya Industry specializes in orthotic support systems and assistive devices with pneumatic artificial muscles in order to contribute to an extended healthy life expectancy. For more information, please visit https://www.daiyak...
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
SYS-CON Events announced today that Interface Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Interface Corporation is a company developing, manufacturing and marketing high quality and wide variety of industrial computers and interface modules such as PCIs and PCI express. For more information, visit http://www.i...
Elon Musk is among the notable industry figures who worries about the power of AI to destroy rather than help society. Mark Zuckerberg, on the other hand, embraces all that is going on. AI is most powerful when deployed across the vast networks being built for Internets of Things in the manufacturing, transportation and logistics, retail, healthcare, government and other sectors. Is AI transforming IoT for the good or the bad? Do we need to worry about its potential destructive power? Or will we...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.
In his session at @ThingsExpo, Greg Gorman is the Director, IoT Developer Ecosystem, Watson IoT, will provide a short tutorial on Node-RED, a Node.js-based programming tool for wiring together hardware devices, APIs and online services in new and interesting ways. It provides a browser-based editor that makes it easy to wire together flows using a wide range of nodes in the palette that can be deployed to its runtime in a single-click. There is a large library of contributed nodes that help so...
DevOps at Cloud Expo – being held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real r...