Welcome!

Containers Expo Blog Authors: Liz McMillan, Pat Romanski, Elizabeth White, Yeshim Deniz, Zakia Bouachraoui

Related Topics: @DXWorldExpo, Containers Expo Blog, @CloudExpo

@DXWorldExpo: Blog Post

Difference Between a Data Lake and a Data Warehouse | @BigDataExpo #BigData #DataLake #Storage

Data lake or data warehouse - what do they do and which one is right for you?

What Is the Difference Between a Data Lake and a Data Warehouse?
By Dave Kellermanns

The data warehouse and data lake are two different types of data storage repository. The data warehouse integrates data from different sources and suits business reporting. The data lake stores raw structured and unstructured data in whatever form the data source provides. It does not require prior knowledge of the analyses you think you want to perform.

What is a Data Lake?
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

What is a Data Warehouse?
A core component of business intelligence, the data warehouse is a central repository of integrated data from one or more disparate sources, and it's used for reporting and data analysis. When the board makes a strategic decision on its future, or a call center agent reviews a customer's profile-the data is typically being sourced from a data warehouse.

Which Should You Choose?
A core component of business intelligence, the data warehouse is a central repository of integrated data from one or more disparate sources, and it's used for reporting and data analysis. So when the board makes a strategic decision on its future, or a customer service agent reviews a customer's historical profile-the data is typically being sourced from a data warehouse.

Meanwhile, a data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

For an analogous definition of a data lake, who better to turn to than the person credited with coining the phrase in the first place: James Dixon, the founder of Pentaho, the Big Data analytics company. He explains, "Think of a data warehouse as a store of bottled water-it's cleansed, packaged, and structured for easy consumption. The data lake meanwhile is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

We can classify the key differences like so:

1. Data Retention
Put simply, data lakes retain all data, while data warehouses do not. During the data warehouse development phase, decision are made about which data sources to use, and which business processes are required. If data isn't required to answer specific questions or in a defined report, it is often excluded from the warehouse in order to reduce cost and optimize performance. Meanwhile, a data lake stores all the data-relevant or not. This possible because the lake resides on lower-cost storage hardware

2. Data Type
Most data warehouses store transaction system data, or quantitative metrics; ignoring unstructured sources such as images, text, or sensor data. Why? Because it's expensive to store them. Data lakes aren't so picky. They absorb all data-irrespective of volume and variety. It is stored in its raw form and only transformed when it is needed. It's called "Schema on Read" vs. the "Schema on Write".

3. User
The data lake users are more cosmopolitan than those that use the data warehouse. It supports people with "Operations" in their title, who are using the data to access reporting data quickly and get analytics information to the board for accelerated decision-making. It supports the users performing more in-depth data analysis, perhaps using a data warehouse as a source and then accessing the source systems for more analysis. And the data lake supports users wanting even deeper-dive analysis. Data scientists, for example, mashing together different types of data and come up with entirely new questions to be answered.

4. Changes
Business today is all about agility; however, many data warehouses are not configured for rapid change. The complexity of the data loading process and the work done to make analysis and reporting easy, make change unnecessarily slow and expensive. Not so in the data lake. Because data is stored in its raw format and is always accessible, users can go beyond the structure of the warehouse to explore data in novel ways and answer their questions at their pace.

This is a valuable synopsis of the differences between both environments:

Data warehouse

Versus

Data lake

Structured, processed

DATA

Structured / semi-structured / structured / raw

Schema-on-write

PROCESSING

Schema-on-read

Expensive for large data volumes

STORAGE

Designed for low-cost storage

Less agile; fixed configuration

AGILITY

Highly agile; configure as required

Mature

SECURITY

Maturing

Business professionals

USERS

Data scientists et al


Read the original blog entry...

More Stories By Automic Blog

Automic, a leader in business automation, helps enterprises drive competitive advantage by automating their IT factory - from on-premise to the Cloud, Big Data and the Internet of Things.

With offices across North America, Europe and Asia-Pacific, Automic powers over 2,600 customers including Bosch, PSA, BT, Carphone Warehouse, Deutsche Post, Societe Generale, TUI and Swisscom. The company is privately held by EQT. More information can be found at www.automic.com.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


IoT & Smart Cities Stories
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Apptio fuels digital business transformation. Technology leaders use Apptio's machine learning to analyze and plan their technology spend so they can invest in products that increase the speed of business and deliver innovation. With Apptio, they translate raw costs, utilization, and billing data into business-centric views that help their organization optimize spending, plan strategically, and drive digital strategy that funds growth of the business. Technology leaders can gather instant recomm...
OpsRamp is an enterprise IT operation platform provided by US-based OpsRamp, Inc. It provides SaaS services through support for increasingly complex cloud and hybrid computing environments from system operation to service management. The OpsRamp platform is a SaaS-based, multi-tenant solution that enables enterprise IT organizations and cloud service providers like JBS the flexibility and control they need to manage and monitor today's hybrid, multi-cloud infrastructure, applications, and wor...
The Master of Science in Artificial Intelligence (MSAI) provides a comprehensive framework of theory and practice in the emerging field of AI. The program delivers the foundational knowledge needed to explore both key contextual areas and complex technical applications of AI systems. Curriculum incorporates elements of data science, robotics, and machine learning-enabling you to pursue a holistic and interdisciplinary course of study while preparing for a position in AI research, operations, ...
After years of investments and acquisitions, CloudBlue was created with the goal of building the world's only hyperscale digital platform with an increasingly infinite ecosystem and proven go-to-market services. The result? An unmatched platform that helps customers streamline cloud operations, save time and money, and revolutionize their businesses overnight. Today, the platform operates in more than 45 countries and powers more than 200 of the world's largest cloud marketplaces, managing mo...
Trend Micro Incorporated, a global leader in cybersecurity solutions, helps to make the world safe for exchanging digital information. Our innovative solutions for consumers, businesses, and governments provide layered security for data centers, cloud workloads, networks, and endpoints. All our products work together to seamlessly share threat intelligence and provide a connected threat defense with centralized visibility and investigation, enabling better, faster protection. With more than 6,00...
Tapping into blockchain revolution early enough translates into a substantial business competitiveness advantage. Codete comprehensively develops custom, blockchain-based business solutions, founded on the most advanced cryptographic innovations, and striking a balance point between complexity of the technologies used in quickly-changing stack building, business impact, and cost-effectiveness. Codete researches and provides business consultancy in the field of single most thrilling innovative te...
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and Bi...