In the recent past I’ve spoken with several different teams who desire to “solve their data problems”. Most teams have a vague sense that there is probably a better way for them to manage their data, and they hear a lot of different solutions thrown at them. Maybe some of you are reading this now!
Here’s my attempt to distill the most frequent problems you might encounter and my understanding of the solutions that you could consider.
Data Lake
Problems Solved:
- My data is scattered amongst several different locations. (Salesforce, Marketo, Stripe, Google Analytics, an application database, mobile backend, spreadsheets, etc.)
A data lake is a database that solves this single problem. It contains copies of all the data from various areas in your business in one database. It maintains the original format of the data. You can write SQL against this data lake and get your data out.
Note that in order to query a data lake, you must understand the structure of each of the data sources, since they’re copied directly into the data lake. And obviously you’ll need to know SQL (or have an engineering team that can craft the SQL for you, or assist you in crafting your SQL).
Data lakes are (typically) not hugely costly to build. It’s simply a matter of copying data from point A to point B.
You’ll (typically) use a SQL runner of some kind to get data out of your data lake. This is a tool that lets you input SQL, run the query, and see the results. Most teams I know that take this route end up dumping the results of their queries into CSV and using Excel or Google Sheets to perform deeper analysis, build pivot tables, construct visualizations for decks, build dashboards for monitoring, etc.
(I must also say that I rarely hear someone speak the work “data lake” in real life. This is simply a quasi-official word in the world of data engineering that has this very specific meaning.)
Data Warehouse
Problems Solved:
- My data is scattered amongst several different locations. (Salesforce, Marketo, Stripe, Google Analytics, an application database, mobile backend, spreadsheets, etc.)
- My data is stored in different formats (CSV, XML, JSON, etc)
- My team doesn’t have the time / appetite to learn SQL in order to get their jobs done on a daily basis
- My team must understand the implementation model of our application to do their jobs
- I often see data that “isn’t quite right”. It’s inaccurate, or missing specific data, or mislabeled, etc.
- My VP of Marketing is learning confusing engineering jargon to figure out what her or his data report means
- Members of different teams at my company us different words to refer to the same business entity (“users”, “members”, “clients”, for example)
- My people are using the same words to mean different things, which causes confusion between teams (e.g. “channel”)
A data warehouse is a carefully designed database that stores data from your businesses various data sources in one place.
Not only has the data been copied from its original source, but it has been combined in a meaningful way to produce a canonical central repository that lets business users get their questions answered. One user is one user is one user. Channels can be paid or organic, or can be more fine-grained.
In a data warehouse, data is stored in a format (usually a Star Schema) that lets business users (non-engineers) access the data without altering their conceptual model of the business they maintain in their heads. (This star schema is specific to the business, almost always denormalized, and formed using techniques of dimensional modeling to provide fast query performance, and fast iteration ability.)
Usually team members access a data warehouse using a BI tool such as Looker or Tableau, which allows for sophisticated data exploration without SQL knowledge. Usually these tools provide for pivoting, visualizations, dashboards, and more.
ETL
ETL stands for Extract, Transform, Load. It’s the mechanism engineering teams use to get the data out of the various data sources and into a well-formed data warehouse that conforms to the star schema for your business.
The extraction step gets data out of the various data sources. It may read SQL, XML, CSV, even binary formats.
The transform step can be simple (reformatting an attribute of a dimension, for example) or it may need to clean, reconcile, deduplicate, pivot, split values into multiple values, perform a calculation based on a value, and/or all of the above and much more. It certainly needs to look up dimensions for a fact table.
The load step simply pushes the data into the data warehouse.
My Opinions
Not all ETL frameworks or services are created equal. Most of the sophisticated frameworks tend to solve huge, gnarly problems but come with congruent learning curves. And some of the lightweight libraries I’ve seen aren’t even “real” ETL. For instance, I recently saw an “ETL” library that did not give you the power to combine rows from various data sources. That’s the primary problem a data warehouse solves. It may be useful functionality (if you’re building a data lake), but it’s not really ETL.
Most companies have the problems that a data warehouse solves. Not every company can afford to invest in a data warehouse, although building one out need not be a massive undertaking. And, in my opinion, most companies don’t even know there is a solution to these problems.
I have seen, and heard of, many teams confuse the problems with the solutions here. My personal approach is to articulate very clearly the list of problems that need to be solved before deciding that a data lake or a data warehouse is necessary. It may be that you can get by using a SQL runner in your application database, or you may benefit immensely from unblocking your nontechnical users getting their data on a daily basis.
The hard parts of the data projects I’ve been a part of have been the dimensional modeling step (What should the schema look like? What are the needs of the company? Who’s the user? Where do we have each piece of data?) and the transform step (performing first and last visit attribution across CRM, marketing automation, and application data, for example). Much of the rest of it, while technically difficult, can be boilerplate-esque scaffold code.
And, generally speaking, the more data you have, and the bigger your company, the more those boilerplate pieces can become “hard problems” themselves. For example, Airflow is an excellent piece of software that takes care of scheduling and sequencing the steps of your ETL. That’s a hard problem only when you reach a certain scale.
ETL and Data Warehousing is not new. It’s been around for decades and is a fairly well-understood problem. In today’s startup culture, I believe we would do well to look backwards at the history of this niche engineering area. There were smart people that came before us, we do not need to expend the cash in our burn rates to painfully re-learn the lessons they learned in 1980.