The Data Source #2 | The Metadata Revolution βπ½
Welcome to The Data Source, your monthly newsletter covering the top innovation in data engineering, analytics and developer-first tooling.
Subscribe now and never miss an issue ππΌ
In this edition of The Data Source, Iβll be covering Metadata Management, a topic thatβs been dominating the data infrastructure scene over the past year and one that I've been actively researching at Work-Bench.Β
Β β°Β Metadata Management in 1 MinuteΒ
What is metadata and why should youΒ care about it?
Think about the scale and complexity of the underlying data infrastructure that powers large organizations. On one hand, you have multiple systems producing, storing, and transforming volumes and volumes of data. On the other, you have a massive ecosystem of users constantly manipulating and consuming this data. Given all these different moving pieces, managing your data operations at scale and doing so in a unified and governed way becomes a real pain.
Put simply, metadata is what helps you understand your datasets and the jobs that read from and write to them. It lets you answer questions such as:
What is the schema of this dataset?
Who owns and produces this job?Β
What is the quality of the data?
What is the provenance of the data?
Whatβs the business context of this data, and more.
Get One Level Deeper ππΌ
Forward-thinking organizations have been building upon their internal metadata management practice for years now. While the initial focus of this initiative may have started around improving the searchability and discoverability of datasets stored in data warehouses, itβs clear that it can be applied to use cases beyond the walls of the warehouse and serving different roles and functions across the organization.
It all starts with having a strong framework around extracting metadata into one source of truth, an end-to-end lineage powering use cases including data operability, access control, quality, auditability, and more. To me, this is where some of the most important gaps in the analytics ecosystem seem to be. To put things in perspective, letβs dive into a couple of these use cases that go beyond search and discovery that have formed in my conversations with founders building in this space and Fortune 500 buyers:
Data Governance & Compliance: This is probably one of the most important priorities for the enterprise today given the amount of data they store across multiple environments, on prem, and on the cloud. When it comes to data governance and compliance-preserving initiatives, having a way of saying βthis is who owns this data, this is how my data changes over time its lifecycle, changing this asset will have this downstream impact, etc.β is powerful. Not only can you monitor every single data entity under your purview but you can also extend the metadata lineage graph to monitor schema changes, access controls, GDPR data deletions and enforce compliance tags where needed. You can also simulate what a change can do and predict whatβs going to happen before it actually happens which is quite compelling from a pure governance perspective.Β
AI Metadata: A lot of what happens downstream to the data warehouse around AI modeling / training, feature engineering, metrics monitoring, and AI experimentation mirrors upstream operations. What I mean by that is that downstream ops generate tons of metadata that needs to be stored and manipulated in a similar way that it is done for upstream ops. This helps create better standardization across key metrics, reproducibility of AI models and features and provides good governance over AI experiments. As such, one interesting use case in metadata management is around storing AI metadata for the data science and analytics teams running these downstream workflows.
Data Pipeline Observability & Quality: The application of metadata to data observability is pretty straightforward. Tools in this category extract lineage of the metadata to help uncover the operational dependencies between multiple data entities for analysis. This is important because it lets you create versions of your data as it changes over its lifecycle so that if the pipeline is ever broken, itβs easy to refer back to this lineage graph and point to the source of the problem. This is especially important in backfilling data, debugging the data pipeline and as described below, in ensuring continuous quality and reliability of your data:ππΌ
Whatβs data quality? There are a few dimensions to the problem that affects the quality and reliability of your data as it flows across your pipelines from the source to its destination: data integrity (do you know where the data is coming from, and can you trust the data, etc.), data freshness (how often is the job running and is it coming through in a timely fashion, etc.), data accuracy and completeness (is the data missing any important entries, are there any inconsistencies in its formats, etc. ). One use case that has emerged in this area is around continuously observing the health of your data pipeline, capturing any deviations from the norm, pointing to anomalies affecting the data and remediating them.
Now, if you look at the current state of the market today, this space is starting to feel increasingly crowded with different players* coming at it from different angles. Although itβs still the early days, given the projects that some of these tools have emerged from I think they will tend to encompass a similar range of problems although their approaches might end up being different.
*(DataHub / Metaphor Data, Transform Data, Marquez / Datakin, Amundsen / Stemma, Solidatus, Bigeye, Soda, Atlan, Tree Schema, Monte Carlo Data and more.)
βΈοΈ GTM Status Quo
Early stage companies in the metadata management space compete with a lot of mindshare with other data companies especially as the market becomes increasingly crowded. While itβs too early to say whoβs going to break out successfully, identifying the right wedge into the market and nailing early GTM will be so critical for these startups to go up-market. Some learnings below:
Targeting the enterprise vs. mid-market / SMB.Β Should you start selling to the enterprise right off the bat or should you start lower and go after SMBs and mid-market? Figuring out what that right wedge is incredibly important here as it can either accelerate your GTM motion or hold you back.
Enterprises today are still in the early innings of their data management initiatives. In fact, many data engineers are still figuring out how to do the most basic things around transforming data in their pipelines. Many organizations donβt even have a mature data platform yet for them to invest in these newer crop of tools. And, what I find in my interactions with many of the Fortune 500 companies is that while there is recognition of the problem around the complexity of managing data at scale, there seems to be no concrete recognition of what that solution should look like.
Defining urgency in a category that is still pretty nascent is key to winning early adopters. Tools emerging out of this category are brand-new both in philosophy and in their approach to tackling the bottlenecks in DataOps. What this means is that the market requires a bit of education around why this tool is needed now more than ever.
What ultimately matters is:How you sell the story: What is the value proposition? What are the killer use cases that will make people want to pay for this? Why is this tool mission critical?Β
How you tie it back to the organizationβs internal business initiatives: What is the breadth of this tooling and what other data initiatives can it help the organization keep up with?.
Once youβve identified and won those early adopters, the next step is to really use this opportunity to keep iterating on the product through feedback and get the enterprise GTM flywheel going.
Selling data management tools is so much of a partnership between the data engineering and data science/analytics and IT orgs. If you look back to the use cases I mention above, youβll see that these touch different teams and user personas sitting across engineering, data science and IT teams. What this means is that youβll need to work on targeted pitches for each of these different teams. And, know that while data science might not always have purchasing power, youβll still need to get buy-in from their key stakeholders for the deal to get approved. On the other hand, depending on the org, IT being the one mandating the tools will be the ones to win over. So working on a pitch that clearly quantifies ROI for each of these teams and upon which all can unanimously agree upon is what you should be going for.Β
What does it mean to build an enterprise-grade tool? Itβs a question that I find myself asking a lot --Β when it comes to data management tools, how does the enterprise want to have the tool implemented and what are the key features that they care about? Sadly, I find that not all startups enter these Fortune 500 conversations truly caring about what the enterprise needs. Rather than taking the time to understand their business initiatives and infrastructure make-ups, many tend to be strongly opinionated about how they are going to commercialize the product. For example, some create SaaS-only platforms and havenβt done much thinking around how they could potentially integrate with the customerβs environment and be on-prem.
And thatβs a wrap folks! To all the founders and VCs out there, Iβd love to swap notes if this is a space youβre passionate about. As always, my Twitter DM is open as is my π©Β (priyanka@work-bench.com), so please reach out!
/Priyanka