Databases vs. Data Warehouses vs. Data Lakes

Irene Yu
4 min readAug 30, 2024

--

This essay is featured in August 2024’s issue of the Skiplevel: Tech for Product Managers newsletter. Every month I share a Tech Term You Should Know (TTYSK) and an essay to level up your technical chops and get the most out of dev teams. Subscribe now.

Chances are you know what a database is and when it is used. Databases are designed for real-time, transactional data, which means they’re built to handle constant updates and queries.

But what about data warehouses and data lakes?

Imagine your team is gearing up to launch the next big feature that relies heavily on advanced analytics to personalize the user experience. To support the feature, one person suggests using a data warehouse to dig into historical data, while someone else points out that a data lake might be better for handling all the unstructured data you’ll be collecting.

If these terms are a bit fuzzy to you, it’ll hard to keep up during the discussion or understand the long-term implications of the decision.

So let’s break down what each of these data storage technologies does, how they differ and when to use them.

Data Lake vs. Data Warehouse
Data Lake vs. Data Warehouse

What is a Data Warehouse?

While a database handles real-time processing of transactions, a data warehouse is built for analyzing data.

Data warehouses are optimized for querying and reporting. They collect and store data from multiple sources, often through a process known as ETL (Extract, Transform, Load), where data is cleaned and organized before it’s stored. This makes it easier to generate reports, track trends over time, and make strategic decisions based on historical data.

For example, if you’re leading a team that needs to analyze customer behavior over the past five years, a data warehouse is where this information would be stored. Tools like Amazon Redshift, Google BigQuery, and Snowflake are popular choices for building and managing data warehouses.

What about a Data Lake?

If databases are structured and organized like a neatly arranged filing cabinet, and data warehouses are like a well-cataloged library, then a data lake is more like a vast, open reservoir where all kinds of data flow in.

A data lake can store massive amounts of raw data in its native format, whether it’s structured data from databases, semi-structured data like logs or JSON files, or unstructured data like text, images, and videos. This flexibility makes data lakes ideal for big data analytics, machine learning, and data science projects, where diverse data types are required.

For instance, if your product team is working on a machine learning model to predict customer churn, the raw data — clickstreams, social media mentions, transaction logs — might be stored in a data lake. From there, data scientists can pull in what they need, process it, and run their analyses.

Popular platforms for building data lakes include Amazon S3, Azure Data Lake, and Hadoop.

A Quick Comparison:

Data structure
Data Warehouses: Stores structured data optimized for analysis
Data Lakes: Stores both structured and unstructured data

Purpose
Data Warehouses: Used for historical analysis and reporting
Data Lakes: Used for big data analytics and machine learning

Scalability
Data Warehouses: More structured and less flexible in terms of data types
Data Lakes: Highly scalable and flexible

When should they be used?

  • Data Warehouses: When your team needs to perform historical data analysis, generate detailed reports, or track long-term trends, a data warehouse is the tool for the job.
  • Data Lakes: If your project involves big data, machine learning, or data that comes in a variety of formats (like logs, videos, or social media feeds), a data lake will provide the flexibility and scalability you need.

In many organizations, databases, data warehouses, and data lakes aren’t mutually exclusive — they work together as part of a comprehensive data strategy. For example, data might be collected and stored in a database for immediate use, then periodically moved to a data warehouse for long-term storage and analysis. At the same time, raw data from various sources could be ingested into a data lake for more complex analytics and machine learning tasks.

If you like what you read, make sure to ❤ it, share it, and leave any thoughts in the comments!

Follow and subscribe for more technical tips!

Want to feel more confident in your technical skills?

Become more technical in just 5 weeks, without learning to code! We also train teams! Find out more at skiplevel.co/teams and book a call with me to get started.

Skiplevel is a program that helps product managers and teams become more technical without learning how to code.

Learn more about the Skiplevel program ⟶

Connect with Irene on LinkedIn and Twitter and follow Skiplevel on LinkedIn, Twitter, and Instagram.

--

--

Irene Yu
Irene Yu

Written by Irene Yu

Founder @ Skiplevel.co: Join top product managers in leveling up your technical skills and ability to communicate with engineers.

No responses yet