Click here to close now.

Welcome!

Containers Expo Blog Authors: Pat Romanski, Jason Bloomberg, Plutora Blog, Elizabeth White, Bill Talbot

Related Topics: @BigDataExpo, Java IoT, Open Source Cloud, Containers Expo Blog, Server Monitoring, @CloudExpo

@BigDataExpo: Blog Feed Post

Big Data – Storage Mediums and Data Structures

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

My working title was Big Data, Storage Dilemma.

They say dilemma. I say dilemna. I'm serious. I spell it dilemna.

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

Should different data structures be persisted to different storage mediums?

Storage Medium
Identifying the appropriate medium is a function of performance, cost, and capacity.

Random Access Memory
It's fast. Very. It's expense. Very.

If we are configuring an HP DL980 G7 server, it will cost $3,672 for 128GB of memory with 8x 16GB modules or $9,996 for 128GB of memory with 4x 32GB modules. That is $28-78 / GB.

We can configure an HP DL980 G7 server with up to 4TB of memory.

Solid State Drive
It's not as fast or as expensive as random access memory. It's faster than a hard disk drive. A lot faster. It's more expensive than a hard disk drive. A lot more expensive.

If we are configuring an HP DL980 G7 server, it will cost $4,199 for a 400GB SAS MLC drive or $7,419 for a 400GB SAS SLC drive. That is $10-18 / GB. It's a performance / size trade-off. While SLC drives often perform better, MLC drives are often available in larger sizes. Further, the read / write performance is either symmetric or asymmetric.

That, and SLC drives often have greater endurance.


Sequential
Read
(MB/s)
Sequential
Write
(MB/s)
Random
Read
(IOPS)
Random
Write
(IOPS)
Intel 520 (480GB) 550 520 50,000 50,000
Intel 520 (240GB) 550 520 50,000 80,000
Intel DC S3700 500 460 75,000 36,000

The performance of solid state drives can vary:

  • consumer / enterprise
  • MLC / eMLC / SLC
  • SATA / SAS / PCIe

The capacity of enterprise solid state drives is often less than that of enterprise hard disk drives (4TB). However, the capacity of PCIe drives such as the Fusion-io ioDrive Octal (5.12TB / 10.24TB) is greater than that of enterprise hard disk drives (4TB). In addition, the performance of PCIe drives is greater than that of SAS or SATA drives.


Sequential
Read
(MB/s)
Sequential
Write
(MB/s)
Random
Read
(IOPS)
Random
Write
(IOPS)
Intel 910 (800GB) 2,000 1,000 180,000 75,000

Hard Disk Drive

It may not be fast, but it is inexpensive.

If we are configuring an HP DL980 G7 server, it will cost $309 for a 300GB 10K drive or $649 for a 300GB 15K drive. That is $1-2 / GB. It's a performance / size trade-off. While a 15K drive will perform better than a 10K drive, it will often have less capacity. The sequential read / write performance of hard disk drives is often between 150-200MB/s. As such, a RAID configuration may be a cost effective alternative to a single solid state drive for sequential read / write access.

The capacity of enterprise hard disk drives (4TB) is often greater than that of enterprise solid state drives.

Data Structure
Hash Table

JBoss Data Grid is an in-memory data grid with the data stored in a hash table. However, JBoss Data Grid supports persistence via write-behind or write-through. For all intents and purposes (e.g. map / reduce), all of the data should fit in memory.

Access

  • Random Reads, Random Writes

Riak is a key / value store. If persistence has been configured with Bitcast, the data is stored in a log structured hash table. An in-memory hash table contains the key / value pointers. The value is a pointer to the data. As such, all of the key / value pointers must fit in memory. The data is persisted via append only log files.

Access

  • Complex Index in Memory
  • Random Reads, Sequential Writes

Point queries perform well in hash tables. Range queries do not. Though there is always map / reduce.

B-Tree
MongoDB is a document database with the data stored in a B-Tree. The data is persisted via memory mapped files.

Access

  • Partial Index (Internal Nodes) in Memory
  • Random Reads, Sequential Reads, Random Writes

CouchDB is a document database with the data stored in a B+Tree. The data is persisted via append only log files.

Access

  • Partial Index in Memory
  • Random Reads, Sequential Reads, Sequential Writes

In a B+ Tree, the data is only stored in the leaf nodes. In a B-Tree, the data is stored in both the internal nodes and the leaf nodes. The advantage of a B+ Tree is that the leaf nodes are linked. As a result, range queries perform better with a B+ Tree. However, point queries perform better with a B-Tree.

CouchDB has implemented an append only B+ Tree. An alternative is the copy-on-write (CoW) B+ Tree. A write optimization for a B-Tree is buffering. First, the data written to an internal buffer in an internal node. Second, the buffer is flushed to a leaf node. As a result, random writes are turned in to sequential writes. The cost of random writes is thus amortized.  However, I am not aware of any open source data stores that have implemented a CoW B+ Tree or have implemented buffering with a B-Tree.

Range queries perform well with a B/B+ Tree. Point queries, not as well as hash tables.

Log Structured Merge Tree
Apache HBase and Apache Cassandra are both column oriented data stores, and they have both store data in an LSM-Tree. Apache HBase has implemented a cache oblivious lookahead array (COLA). Apache Cassandra has implemented a sorted array merge tree (SAMT).

In both implementations, a write is first written to a write-ahead log. Next, it is written to a memtable. A memtable is an in-memory sorted string table (SSTable). Later, the memtable is flushed and persisted as an SSTable. As result, random writes are turned in to sequential writes. However, a point query with an LSM-Tree may require multiple random reads.

Apache HBase implemented a single-level index with HFile version 1. However, because every index was cached, it resulted in high memory usage.

Apache HBase implemented a multi-level index a la a B+ Tree with HFile (SSTable) version 2. The SSTable contains a root index, a root index with leaf indexes, or a root index with an intermediate index and leaf indexes. The root index is stored in memory. The intermediate and leaf indexes may be stored in the block cache. In addition, the SSTable contains a compound (block level) bloom filter. The bloom filter is used to determine if the data is not in the data block.

Recommendation / Access

  • Partial Index / Bloom Filter in Memory
  • Random Reads, Sequential Reads, Sequential Writes

A read optimization for an LSM-Tree is fractional cascading. The idea is that each level contains both data and pointers to data in the next level. However, I am not aware of any open source data stores that have implemented fractional cascading.

I like the idea of a hybrid / tiered storage solution for an LSM-Tree implementation with the first level in memory, second level on solid state drives, and third level on hard disk drives. I've seen this solution described in academia, but I am not aware of any open source implementations.

Conclusion
I find it interesting that key / value stores often store data in a hash table, that document stores often store data in a B+/B-Tree, and that column oriented stores often store data in an LSM-Tree. Perhaps it is because key / value stores focus on the performance of point queries, document stores support secondary indexes (MongoDB) / views  (CouchDB), and column oriented stores support range queries. Perhaps it because document stores were not created with distribution (shards / partitions) in mind and thus assume that the complete index can not be stored in memory.

Notes
I found this paper to be very helpful in examining data structures as it covers most of the ones highlighted in this post:

Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers (link)

Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.

@ThingsExpo Stories
"We have seen the evolution of WebRTC right from the starting point to what it has become today, that people are using in real applications," noted Dr. Natasha Tamaskar, Vice President and Head of Cloud and Mobile Strategy and Ecosystem at GENBAND, in this SYS-CON.tv interview at WebRTC Summit, held June 9-11, 2015, at the Javits Center in New York City.
The 3rd International WebRTC Summit, to be held Nov. 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA, announces that its Call for Papers is now open. Topics include all aspects of improving IT delivery by eliminating waste through automated business models leveraging cloud technologies. WebRTC Summit is co-located with 15th International Cloud Expo, 6th International Big Data Expo, 3rd International DevOps Summit and 2nd Internet of @ThingsExpo. WebRTC (Web-based Real-Time Communication) is an open source project supported by Google, Mozilla and Opera that aims to enable bro...
Internet of Things (IoT) will be a hybrid ecosystem of diverse devices and sensors collaborating with operational and enterprise systems to create the next big application. In their session at @ThingsExpo, Bramh Gupta, founder and CEO of robomq.io, and Fred Yatzeck, principal architect leading product development at robomq.io, discussed how choosing the right middleware and integration strategy from the get-go will enable IoT solution developers to adapt and grow with the industry, while at the same time reduce Time to Market (TTM) by using plug and play capabilities offered by a robust IoT ...
It is one thing to build single industrial IoT applications, but what will it take to build the Smart Cities and truly society-changing applications of the future? The technology won’t be the problem, it will be the number of parties that need to work together and be aligned in their motivation to succeed. In his session at @ThingsExpo, Jason Mondanaro, Director, Product Management at Metanga, discussed how you can plan to cooperate, partner, and form lasting all-star teams to change the world and it starts with business models and monetization strategies.
SYS-CON Events announced today that ProfitBricks, the provider of painless cloud infrastructure, will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. ProfitBricks is the IaaS provider that offers a painless cloud experience for all IT users, with no learning curve. ProfitBricks boasts flexible cloud servers and networking, an integrated Data Center Designer tool for visual control over the cloud and the best price/performance value available. ProfitBricks was named one of the coolest Clo...
The 4th International Internet of @ThingsExpo, co-located with the 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - announces that its Call for Papers is open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than
17th Cloud Expo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises are using some form of XaaS – software, platform, and infrastructure as a service.
"In the IoT space we are helping customers, mostly enterprises and industry verticals where time-to-value is critical, and we help them with the ability to do faster insights and actions using our platform so they can transform their business operations," explained Venkat Eswara, VP of Marketing at Vitria, in this SYS-CON.tv interview at @ThingsExpo, held June 9-11, 2015, at the Javits Center in New York City.
Discussions about cloud computing are evolving into discussions about enterprise IT in general. As enterprises increasingly migrate toward their own unique clouds, new issues such as the use of containers and microservices emerge to keep things interesting. In this Power Panel at 16th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists addressed the state of cloud computing today, and what enterprise IT professionals need to know about how the latest topics and trends affect their organization.
WebRTC converts the entire network into a ubiquitous communications cloud thereby connecting anytime, anywhere through any point. In his session at WebRTC Summit,, Mark Castleman, EIR at Bell Labs and Head of Future X Labs, will discuss how the transformational nature of communications is achieved through the democratizing force of WebRTC. WebRTC is doing for voice what HTML did for web content.
To many people, IoT is a buzzword whose value is not understood. Many people think IoT is all about wearables and home automation. In his session at @ThingsExpo, Mike Kavis, Vice President & Principal Cloud Architect at Cloud Technology Partners, discussed some incredible game-changing use cases and how they are transforming industries like agriculture, manufacturing, health care, and smart cities. He will discuss cool technologies like smart dust, robotics, smart labels, and much more. Prepare to be blown away with a glimpse of the future.
Connected things, systems and people can provide information to other things, systems and people and initiate actions for each other that result in new service possibilities. By taking a look at the impact of Internet of Things when it transitions to a highly connected services marketplace we can understand how connecting the right “things” and leveraging the right partners can provide enormous impact to your business’ growth and success. In her general session at @ThingsExpo, Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, discussed how this exciting emergence of layers of...
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits, DevOps is corr...
The Internet of Things is not only adding billions of sensors and billions of terabytes to the Internet. It is also forcing a fundamental change in the way we envision Information Technology. For the first time, more data is being created by devices at the edge of the Internet rather than from centralized systems. What does this mean for today's IT professional? In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists addressed this very serious issue of profound change in the industry.
SYS-CON Events announced today that kintone has been named “Bronze Sponsor” of SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. kintone promotes cloud-based workgroup productivity, transparency and profitability with a seamless collaboration space, build your own business application (BYOA) platform, and workflow automation system.
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal an...
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists peeled away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud environment, and we must architect and code accordingly. At the very least, you'll have no problem fillin...
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at @ThingsExpo, James Kirkland, Red Hat's Chief Architect for the Internet of Things and Intelligent Systems, described how to revolutionize your archit...
SYS-CON Events announced today that Secure Infrastructure & Services will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Secure Infrastructure & Services (SIAS) is a managed services provider of cloud computing solutions for the IBM Power Systems market. The company helps mid-market firms built on IBM hardware platforms to deploy new levels of reliable and cost-effective computing and high availability solutions, leveraging the cloud and the benefits of Infrastructure-as-a-Service (IaaS...