The Data Source #2 | The Metadata Revolution ✊🏽
Welcome to The Data Source, your monthly newsletter covering the top innovation in data engineering, analytics and developer-first tooling.
Subscribe now and never miss an issue 👇🏼
In this edition of The Data Source, I’ll be covering Metadata Management, a topic that’s been dominating the data infrastructure scene over the past year and one that I've been actively researching at Work-Bench.
⏰ Metadata Management in 1 Minute
What is metadata and why should you care about it?
Think about the scale and complexity of the underlying data infrastructure that powers large organizations. On one hand, you have multiple systems producing, storing, and transforming volumes and volumes of data. On the other, you have a massive ecosystem of users constantly manipulating and consuming this data. Given all these different moving pieces, managing your data operations at scale and doing so in a unified and governed way becomes a real pain.
Put simply, metadata is what helps you understand your datasets and the jobs that read from and write to them. It lets you answer questions such as:
What is the schema of this dataset?
Who owns and produces this job?
What is the quality of the data?
What is the provenance of the data?
What’s the business context of this data, and more.
Get One Level Deeper 👇🏼
Forward-thinking organizations have been building upon their internal metadata management practice for years now. While the initial focus of this initiative may have started around improving the searchability and discoverability of datasets stored in data warehouses, it’s clear that it can be applied to use cases beyond the walls of the warehouse and serving different roles and functions across the organization.
It all starts with having a strong framework around extracting metadata into one source of truth, an end-to-end lineage powering use cases including data operability, access control, quality, auditability, and more. To me, this is where some of the most important gaps in the analytics ecosystem seem to be. To put things in perspective, let’s dive into a couple of these use cases that go beyond search and discovery that have formed in my conversations with founders building in this space and Fortune 500 buyers:
Data Governance & Compliance: This is probably one of the most important priorities for the enterprise today given the amount of data they store across multiple environments, on prem, and on the cloud. When it comes to data governance and compliance-preserving initiatives, having a way of saying “this is who owns this data, this is how my data changes over time its lifecycle, changing this asset will have this downstream impact, etc.” is powerful. Not only can you monitor every single data entity under your purview but you can also extend the metadata lineage graph to monitor schema changes, access controls, GDPR data deletions and enforce compliance tags where needed. You can also simulate what a change can do and predict what’s going to happen before it actually happens which is quite compelling from a pure governance perspective.
AI Metadata: A lot of what happens downstream to the data warehouse around AI modeling / training, feature engineering, metrics monitoring, and AI experimentation mirrors upstream operations. What I mean by that is that downstream ops generate tons of metadata that needs to be stored and manipulated in a similar way that it is done for upstream ops. This helps create better standardization across key metrics, reproducibility of AI models and features and provides good governance over AI experiments. As such, one interesting use case in metadata management is around storing AI metadata for the data science and analytics teams running these downstream workflows.
Data Pipeline Observability & Quality: The application of metadata to data observability is pretty straightforward. Tools in this category extract lineage of the metadata to help uncover the operational dependencies between multiple data entities for analysis. This is important because it lets you create versions of your data as it changes over its lifecycle so that if the pipeline is ever broken, it’s easy to refer back to this lineage graph and point to the source of the problem. This is especially important in backfilling data, debugging the data pipeline and as described below, in ensuring continuous quality and reliability of your data:👇🏼
What’s data quality? There are a few dimensions to the problem that affects the quality and reliability of your data as it flows across your pipelines from the source to its destination: data integrity (do you know where the data is coming from, and can you trust the data, etc.), data freshness (how often is the job running and is it coming through in a timely fashion, etc.), data accuracy and completeness (is the data missing any important entries, are there any inconsistencies in its formats, etc. ). One use case that has emerged in this area is around continuously observing the health of your data pipeline, capturing any deviations from the norm, pointing to anomalies affecting the data and remediating them.
Now, if you look at the current state of the market today, this space is starting to feel increasingly crowded with different players* coming at it from different angles. Although it’s still the early days, given the projects that some of these tools have emerged from I think they will tend to encompass a similar range of problems although their approaches might end up being different.
*(DataHub / Metaphor Data, Transform Data, Marquez / Datakin, Amundsen / Stemma, Solidatus, Bigeye, Soda, Atlan, Tree Schema, Monte Carlo Data and more.)
☸️ GTM Status Quo
Early stage companies in the metadata management space compete with a lot of mindshare with other data companies especially as the market becomes increasingly crowded. While it’s too early to say who’s going to break out successfully, identifying the right wedge into the market and nailing early GTM will be so critical for these startups to go up-market. Some learnings below:
Targeting the enterprise vs. mid-market / SMB. Should you start selling to the enterprise right off the bat or should you start lower and go after SMBs and mid-market? Figuring out what that right wedge is incredibly important here as it can either accelerate your GTM motion or hold you back.
Enterprises today are still in the early innings of their data management initiatives. In fact, many data engineers are still figuring out how to do the most basic things around transforming data in their pipelines. Many organizations don’t even have a mature data platform yet for them to invest in these newer crop of tools. And, what I find in my interactions with many of the Fortune 500 companies is that while there is recognition of the problem around the complexity of managing data at scale, there seems to be no concrete recognition of what that solution should look like.
Defining urgency in a category that is still pretty nascent is key to winning early adopters. Tools emerging out of this category are brand-new both in philosophy and in their approach to tackling the bottlenecks in DataOps. What this means is that the market requires a bit of education around why this tool is needed now more than ever.
What ultimately matters is:How you sell the story: What is the value proposition? What are the killer use cases that will make people want to pay for this? Why is this tool mission critical?
How you tie it back to the organization’s internal business initiatives: What is the breadth of this tooling and what other data initiatives can it help the organization keep up with?.
Once you’ve identified and won those early adopters, the next step is to really use this opportunity to keep iterating on the product through feedback and get the enterprise GTM flywheel going.
Selling data management tools is so much of a partnership between the data engineering and data science/analytics and IT orgs. If you look back to the use cases I mention above, you’ll see that these touch different teams and user personas sitting across engineering, data science and IT teams. What this means is that you’ll need to work on targeted pitches for each of these different teams. And, know that while data science might not always have purchasing power, you’ll still need to get buy-in from their key stakeholders for the deal to get approved. On the other hand, depending on the org, IT being the one mandating the tools will be the ones to win over. So working on a pitch that clearly quantifies ROI for each of these teams and upon which all can unanimously agree upon is what you should be going for.
What does it mean to build an enterprise-grade tool? It’s a question that I find myself asking a lot -- when it comes to data management tools, how does the enterprise want to have the tool implemented and what are the key features that they care about? Sadly, I find that not all startups enter these Fortune 500 conversations truly caring about what the enterprise needs. Rather than taking the time to understand their business initiatives and infrastructure make-ups, many tend to be strongly opinionated about how they are going to commercialize the product. For example, some create SaaS-only platforms and haven’t done much thinking around how they could potentially integrate with the customer’s environment and be on-prem.
And that’s a wrap folks! To all the founders and VCs out there, I’d love to swap notes if this is a space you’re passionate about. As always, my Twitter DM is open as is my 📩 (priyanka@work-bench.com), so please reach out!
/Priyanka