The Data Source #3 | 💥 Two Trends Shaping the Modern Data Stack
Welcome to The Data Source, your monthly newsletter covering the top innovation in data engineering, analytics and developer-first tooling.
Subscribe now and never miss an issue 👇🏼
Looking back to the past 5 or so years in data infrastructure, there’s been significant innovation in tooling and engineering best practices that have contributed to what is known today as the Modern Data Stack.
⏰ The Modern Data Stack, in 1 minute
Modern data warehouses today can handle much faster queries and data processing than ever thought possible. In fact, the data infrastructure layer has improved significantly over recent years during which we’ve seen the rise of Snowflake as the data warehouse, built for the cloud-first world. What differentiates Snowflake from legacy warehouses is that it is designed as a multi-cluster shared architecture in which storage, compute and services exist as separate layers. This decoupling of layers is important because it means that compute and storage can be scaled up to near-infinite limits. This gives data teams greater agility in their handling of massive workloads and is beneficial both from a performance and cost standpoint.
Built around the cloud providers, Snowflake leverages a SQL database and query engine. This enables data to be ingested from a variety of sources and it allows for anybody who knows SQL to query data directly from it for their own analytical purposes. As such, Snowflake created a more coherent way for centralizing data into a single source of truth, accessible by all (given the right permissions).
The data ingestion and transformation layers have also gotten better over time. In fact, it’s the rise of the cloud data warehouse (e.g. Snowflake, BigQuery, Redshift) and data lake/lakehouse (e.g. Databricks) that spurred the shift from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform) in data pipelines and introduced a new way for transforming data at scale. Owing to the decoupling of storage and compute, data is now being loaded in bulk and transformed right within the warehouse. Tools like Fivetran, Stitch Data, Airbyte, Datalogue, among others have made it easier to pull data from multiple sources and load it into the warehouse at much greater reliability. Fivetran for instance, serves analysts with data flowing from disparate sources through fully automated connectors. As for dbt, the tool is becoming the de facto tool for data modeling by allowing the data analysts to transform data through a simple SQL interface without having to rely on any engineers.
Together, advancements in the data warehousing, ingestion, and transformation layers have paved the way for making of the data warehouse one of the most critical components of the data stack. From Snowflake, to Fivetran, to dbt, these tools have changed the way in which data and business teams collaborate with one another. And, I think that as companies keep on centralizing all of their data processing into the warehouse, this will create new opportunities for business teams to consume data directly out of it in true autonomous fashion.
A lot of the engineering work that usually goes into serving data across business teams, will be abstracted away through more robust tooling. Just like what dbt did for open sourcing the data transformation layer to analysts, I think we are going to see more of that philosophy carry over onto the next step defining the Modern Data Stack where data and business teams will have more of a symbiotic way of working with one another.
🌟 Trends shaping the Modern Data Stack
Consuming data out of the data warehouse / lake to power near-real time business actions
Up to this point, the distinct layers that make up the Modern Data Stack across data extraction and loading (Segment, Fivetran, Google Analytics, Stitch Data, Datalogue, etc.), transformation (dbt) and cloud data warehousing (Snowflake, BigQuery, RedShift) are fairly consolidated at this point.
So, what’s next in the Data Stack?
Building upon my previous point around the prominence of the cloud data warehouse as the single source of truth for data, what I’ve been observing is that as more and more companies centralize their data into the warehouses, the opportunity to leverage this data in real-time and unlock value out of it becomes more apparent.
There are a few ways in which transformed data from the data warehouse / lakehouse is served over to data consumers:
Data from the warehouse:
Business intelligence tools - data is routed to BI tools such as Looker, Tableau, Mode Analytics, etc. where BI users use the data to build out reports, dashboards and visualizations.
Third-party applications - data is routed to third-party operational apps such as Salesforce, Zendesk, Hubspot, Marketo, etc. where data is leveraged to drive specific business actions, many of which need to be executed in near real-time (e.g. responding to customer service requests on Zendesk and HubSpot, tracking access to sensitive data stored in Salesforce, etc.)
Data from the lake / lakehouse:
Data science applications - data is fed into data science and machine learning apps where data scientists leverage the data for their own reporting and analytics purposes.
An emerging trend I’m observing here is that given the concept of the cloud data warehouse as the single source of truth for data, there are full-fledged data apps that are being built right on top of the warehouse to make it easy to consume data out of it. Traditionally, data engineers are the ones building out one-off connectors to push data from the warehouse to third-party applications. And, depending on the specific requirements that the business teams might have for a particular task, they are going to have to ping the engineers to ask for a particular segment of the data to be made available to them.
But what if we could give business teams superpowers to slice and dice their data without constantly having to rely on the engineers? It’s what tools like Census, Hightouch, Grouparoo, Polytomic and Seekwell aspire to do. They empower business teams through self-serve data access by enabling them to query data from the warehouse through a simple SQL interface that they are mostly familiar with. As the platform handles all of the heavy lifting around synching data from the warehouse to the various apps, data engineers now have more time to focus on fixing the harder problems in the infrastructure layer.
I’m excited to see how emerging tools in this space continue to shape the way in which data is being accessed and operationalized at scale, because I do think that it’s the way that will help data and business teams to be aligned in their functions.
But there is one caveat that I want to call out: While modern data warehouses have significantly improved their batch-processing and query performance over time, maintaining low latency is still a pain point. Not being able to maintain low latency means that there are always going to be inconsistencies between data that is being updated in the warehouse and data that is being fed into thirty-party applications. This wouldn’t matter for BI apps where reports are often built upon historical data but when you’re relying on data to drive specific business actions as is the case with operational apps, having access to the latest and greatest version of the data becomes critical. It’s why I think that going forward we’re going to see more innovation in stream data processing, which is the next trend that I wanted to cover in this post: 👇🏼
Processing stream to power real-time applications development & analyses
While data warehouses are traditionally known to be batch-first in their approach to data processing, they are evolving to support real-time analytics use cases. Snowflake has already taken a step in that direction by launching its own version of the Change Data Capture (CDC) that allows for changes to be tracked through “Table Streams.” Essentially, what the CDC does is that it records and tracks changes as they are being made to the source system (where data is continuously being processed) so that actions can be taken based on the changed data in the target system.
Rather than having to re-compute data all over again as is the case with batch replication, Snowflake refreshes data against a SQL query so that it’s computed incrementally. This is such an important functionality as it helps support analytics use cases that have strong requirements for low latency in processing.
Most recently, we’ve seen the emergence of Materialize, a SQL streaming data warehouse tool that enables users to query data directly from event streams through a SQL interface. Another tool worth mentioning here is Estuary.dev, which unifies real-time streaming and batch data and makes it easy to build real-time applications and view historical data thus offering the best of both batch and stream-first worlds.
And I think as streaming data becomes more mainstream, the warehouse will evolve into a platform that can support a broader range of data sources (both batch and streams) to enable real-time query and analysis.
And that’s a wrap folks! To all the founders and VCs out there, I’d love to swap notes if this is a space that you’re passionate about. My Twitter DM is open and you can 📩 at email@example.com!