The Data Source #4 | ☁️ Cloud-Native Observability & Opportunity Ahead
Welcome to The Data Source, your monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling.
Subscribe now and never miss an issue 👇🏼
Lately, I’ve been thinking a lot about how modern SRE teams operate in their day-to-day and the types of tools they need to be more efficient at their jobs.
SREs are tasked with making sure systems are resilient and are responsible for maintaining high availability and performance of services, among other things. As we’ve seen with our portfolio company FireHydrant, which helps SREs create consistency across the entire incident response lifecycle, the expectations of a consistent and seamless digital experience has made uptime, functionality, and performance more important than ever. In order to do that, SREs need tools that help them understand all of the different aspects of how a system actually works so they can better identify and solve performance bottlenecks.
Today, it’s quite common for SREs to leverage monitoring tools like Datadog, Splunk, Dynatrace, etc. that watch how the system behaves using predefined sets of metrics and logs and alert on known problems. While these tools are great at helping you understand at a high level, what's broken and why that happened, what’s equally important is the ability to capture unknown issues that might potentially affect your services. To help with this, SREs leverage observability tools that enable them to watch how their systems behave from their external outputs. Together, monitoring and observability tools are complementary products in the modern SRE toolkit.
An Overview of Cloud-Native Observability
The concept of observability has traditionally been based on three pillars namely Metrics, Logs, and Traces. Together, these help bring better visibility into the behaviors of distributed systems. At a high level, they can be described as follows:
Metrics, i.e., numeric measurements, calculated over time that are well suited to express events happening in the system (e.g. % of CPU used, % memory space available, etc.) and trigger alerts based on the health of the system.
Tools: Dynatrace, AppDynamics, Datadog, Splunk, NewRelic, Instana
Event Logs, i.e., structured or unstructured text data that a system emits when a block of code is executed. These are immutable and timestamped records, useful in capturing information that developers can use to retroactively inspect their codes.
Tools: LogDNA, Logz.io, Scalyr, Humio
Traces, i.e., data that expresses chains of events and interdependencies between different components in distributed systems in order to monitor and debug complex software architectures such as microservices.
Tools: Lightstep, Epsagon, Jaeger, Zipkin
Over time, many of the tools that originally started as infrastructure monitoring products have evolved their coverage and moved into the broader observability space. In 2018, Datadog for example, became one of the first monitoring tools to package the “three pillars” of observability onto its platform offering APM, log management, and distributed tracing capabilities. We’ve seen similar trends in companies such as Dynatrace, New Relic and more, that are taking a similar wholesale approach (i.e., one platform that does it all) to better cater to customer needs around improved observability.
But we’ve also seen the rise of tools that target a specific domain within the broader context of observability and focus on executing that really well. Advancements in distributed tracing, automated code analysis and debugging technology have for the most part normalized the practice of doing testing in the production environment in a way that doesn’t impact or slow down applications. In the past, it was only possible to run automated tests in development as it was the safest thing to do (there was always this fear of something breaking and severely impacting production). But those tests would always fall short as the development environment is not an accurate representation of the production environment. Tools in this category help investigate code behavior, surface hard-to-find bugs, and debug issues in production without impacting end-users or slowing down operations. Companies of note are Rookout, Honeycomb, and Lightrun.
As cloud infrastructure continues to grow in complexity, we are seeing systems engineers needing more granular performance data other than metrics, logs and traces. It’s where I think application profiling might be useful. Google has been running its own profiling system for over a decade now, but new entrants into the market are helping teams run these tools continuously and at scale. Unlike metrics that alert you when something is wrong, continuous application profilers provide systems engineers with actionable insights by tracking down to the line of code the computational hotspots that are causing those issues. With continuous profiling, one can analyze application performance data (e.g., CPU usage, allocated memory, resource utilization, etc.) through series of profiles captured over time. Companies I’m tracking include Opsian, Pyroscope, Optimyze, and PolarSignals.
Interesting Projects (Application Profiling) 💥
Further Reading 📚
Continuous Profiling in Production: What, Why and How (Presentation)
I’m excited to watch the application profiling ecosystem as it continues to evolve and shape the future of observability. If you or someone you know is tackling continuous application profiling, please reach out! My Twitter DM is open and you can 📩 at email@example.com!
Work-Bench is an early stage enterprise software-focused VC firm based in NYC with our sweet spot for investment being at the Seed II stage which correlates with building out a startup’s early go-to-market motions. In the data world, we’ve been fortunate to invest in companies like Cockroach Labs, Arthur, Algorithmia, Datalogue, Alkymi, x.ai and others.