Datadog, Open Telemetry, & A History of Observability

Observability, it’s a big word. Can you define observability, or at least, how you would define it.

The way I define observability is that it is a capability of a team to interact with its systems; to understand their systems in real time, on questions that you didn't anticipate you will need to ask until you ask them. That's the theory of what observability is. Then in practice, what we use to accomplish observability, is a combination of telemetry signals, like tracing or logging, or metrics for profiling, in combination with a storage engine to query them, as well as the ability of humans to interact with that system.

That’s how we achieve observability in practice. It is a combination of things, where people believe that observability is about the signals. The signals are important, but it's what you do with them that matters.

Taking a step back, how did this all work when everything was on-prem?

I think it's hard to disentangle the history of how we got here, to what some practices are, because today you can achieve observability regardless of where applications are located. What matters is instrumenting your applications so that they produce the relevant telemetry. There's no difference today between the observability that I would do in the cloud versus on-prem, with the minor exception of what host-level metrics you keep to help you out if there is a problem with an individual host.

In the olden days, we struggled to achieve observability by doing log searches. You would scrape logs from individual hosts, and that you would try to make sense of what was happening with the tools you had at the time. For on-prem 10 years ago, that was logging.

Were all these different products that the big players offer, all separate? Did you have to run them separately and then try to find the problem?

Yes. The way we got to where we are today with the market is that you have all the legacy players coming from their individual positions of strength. That Splunk started as a logging-first company, and indeed, they were innovative when they introduced the idea that you don’t need SSH individual hosts, and you don't need to grep through the logs, that Splunk will collect them and centrally index them. That was revolutionary at the time, and since then, they have kept themselves relevant by saying they are now on board with this observability thing, and they are now contributing to open-source projects and developing an APM tool for the combination of building and buying. They acquired Omnition.

Basically, Splunk came from the logs world, Datadog came from the metrics world, and then you have providers like Dynatrace or AppDynamics coming from the APM world, all of which are different, interesting routes to arrive at trying to develop an observability solution for modern systems, as we understand them today.

Does it matter which product, or where you come from, in determining who has an advantage today?

My view, and our view at Honeycomb certainly – we have our biases – is that starting a clean slate, and thinking about all the various signal types as all being categorizable as just special cases of wide events, of key-value pairs that you index as events, that is the universal approach, that doesn’t require you to have a bias towards one or other of the signal types; they all end up being wide events.

In contrast, I feel like if your company's bread and butter is logs, or your company's bread and butter is metrics, it can hinder you in some respects, if there was the danger of cannibalizing your previous business. Where, if the new modern thing is not high-volume log indexing, is the perspective of having then high-volume log indexing biased the way your company thinks about the problem?

I don’t know. I think it's an interesting set of circumstances. I think it’s an interesting thing for people to explore and look at. At the end of the day, people should choose vendors based on what capabilities they offer and how their software developers can interact with the system. So, it may be that in circumstances where you're used to using a logging solution and that logging solution pivots to observability, maybe that’s the best thing for you.

Going back to that earlier question about approaches, I think there is lots of interesting stuff to learn from companies, like my employer Honeycomb, or from our competitors like Lightstep, like Aspecto, these newer generations of tools that are not biased by having a legacy install base of a very traditional product.

For us non-techies or engineers, can you explain how a DevOps engineer, or someone else, receives telemetry from a cloud application?

When you have application-level telemetry, that comes from inserting a set of libraries into that application. Whether it's done at compile time or via an agent that installs itself, the mechanism is that you hook into the application to collect information on executions, and to collect information on incoming requests.

{audio:0:06.47} Let’s suppose you have a service that serves a request hypothetically to Twitter.com; you’re browsing the website Twitter, and you hit Twitter.com, and it asks for a list of the 20 top recent posts. That will go to one of the back-end services, and in that back-end service, it basically boils down to starting and stopping a stopwatch. It will literally measure the time the request came in, some properties of the request, like which user was logged in, IP address, which browser, etc., and it starts the timer. When the request is finished, it stops the timer and says, ‘Okay, here’s all the information I know about this request, now I’m going to forward it onward.’ That’s where the data comes from, and then there are obviously some additional layers to this, in that if that initial service can’t answer that question, you might call a bunch of other services. This is where we get the micro-service design patterns, where each service does one thing, and you have fleets of micro-services that all need to work together.

You might have not just one start and stop timer, but five start and stop timers, or a dozen or maybe even a 100 that are stopped and started to tell you where you spend your time while evaluating that little flick of your thumb that you're doing when you’re browsing through Twitter. Those services, either through having an agent automatically measure that, or by having integrated a software library to do so, will send the information on when the request started and stopped, and all those properties, to a collection back-end. It will then process that and make it available for a developer at Twitter, for instance, to understand what's going on with this collection of requests, what's going on with the behavior, and see if there are any commonalities, and otherwise start trying to dive into some of those unknown questions that we might have about what's going on with the performance of the software.

It must have been so much more complicated, going from physical hardware to cloud instances to now, containers and serverless micro-services.

Yes. As you said, the containers and the micro-services piece, that's the most interesting part of that. I think that's where the complexity really exploded. I think when we went from on-prem to individual large, beefy virtual machines and ran the monolith, we still could, to some extent, use the previous generation tooling of logging, or of single host APM, to resolve these questions about what was happening. However, when as software developers, we made an industry design choice to split our services to allow teams to move independently, and to ship and deliver software on an independent schedule and on individual components, this helped the individual teams move more quickly. However, it added complexity, as it is not all located in one place and not all being one process. Whether it be on-prem or on a single virtual machine, it was now scattered across multiple containers, across container orchestration, and across many hundreds of hosts, potentially.

When looking at the market size and growth, what percentage of the application or infrastructure stack is monitored today? What would your estimate be for the biggest enterprises?

I think here is where we get into the challenge of the terminology. As I said at the beginning, the definition of observability is contested in some ways. I adhere to this view that it's about the squishy notion of, can a team evaluate unknown unknowns? However, that doesn't quite as neatly map to technology choices so much as it maps to what are the practices that engineering teams are doing, supported by their tooling. So, if you ask, what percentage of applications are monitored by enterprises, the answer is 100%. 100% of applications are monitored with some degree of either modern observability or metrics or logs.

When you start asking more specific questions, like, what percentage of companies are using distributed tracing effectively, then I would say that in its early days. We’re talking about something like 15% to 20% of companies are effectively leveraging distributed tracing. It's not nearly the percentage that we hope the market will grow to. I think that those 80% of companies that are not yet leveraging distributed tracing could be because either they’re not mature enough, they’re not necessarily at the point where they could benefit from distributed tracing, and maybe logging and metrics can resolve problems. Or maybe they are already starting to see some of these challenges and struggling with them, but they haven't then adopted the methodology and tooling they need.

That's where we see and anticipate growth in the observability market over the next five years, moving from 20% of the market using modern observability tooling, to 40% or 60%.

What, in your mind, are the biggest limitations to market growth?

I think the main limitations to growth in the observability space are that it's, historically, been challenging to get traction with observability efforts; that it's, historically, been all or nothing. This idea that to benefit from observability tooling, you had to roll out distributed tracing everywhere, to all your services. What we're starting to see is that individual companies can be successful at the team level by adopting, not necessarily distributed tracing, but single request, single service tracing, and being able to drill down into attributes of that service, or into the execution request within that individual service, without necessarily needing to boil the entire ocean.

I think that five years ago, distributed tracing was hard to get started, and it was hard to get value. Many people were burned by the earlier implementations of distributed tracing. Today, it's now possible to realize value earlier, and one of the efforts that I'm sure we'll talk about in a little bit is the convergence of single solutions for better vendor-neutral, instead of vendor lock-in solutions. Those are barriers that we’re in the process of dismantling, to ensure that people feel confident in investing in observability, through tracing.

When you say, ‘distributed tracing,’ how would you define that for us non-engineers?

I think it goes to what I was just describing, with swiping your thumb on the Twitter app, and a request is dispatched to Twitter, which could cause 100 different services, scattered across 1000 machines at Twitter, to do something on your behalf. Being able to put together that picture of what caused what to happen. That causal tree is what distributed tracing is aiming to give you.

All those players offer that? Datadogs, Splunk, New Relic, they offer some sort of tracing service?

That's correct. As we said, it boils down to what someone's perspective is, where they're coming at it from. You'll see that players like Dynatrace, like AppDynamics have adapted existing single host APM solutions to now be distributed. You'll see players like Splunk try to connect loglines with trace context IDs, and to marry that with technology they have acquired. Basically, all these major players have tracing and observability offerings today, while also having these legacy offerings, which are distinct from observability, and in other cases they are part of their observability suite.

Would you say Datadog has any legacy offerings, given they were built on the cloud?

Datadog was built on the cloud, but Datadog was built as a metrics first product. They were oriented solely around infrastructure metrics collection. Then they pivoted and said, you can have app metrics. The problem for Datadog’s users, but the way that Datadog profits is when you have application metrics that you're inputting with the same APIs that Datadog uses for infrastructure metrics. Datadog will charge their customers per combination of tags, per combination of dimensions and unique values, or cardinality per tag.

That’s how Datadog has tried to leverage the shift towards people wanting to get more granular insight into their applications and metrics product. Also, in the past three to four years, they started doing work around tracing as part of their product suite.

They’re effectively double charging customers? If they want APM, do you have to have the metrics as well?

That’s essentially correct, yes. Datadog charges a per host fee for collecting infrastructure metrics, and then a per host fee for collecting APM and tracing data. It effectively boils down to the earlier question I asked, should we be approaching observability clean slate, or should we be trying to build observability as a multi-product strategy? Or as an evolution of one of the existing pillars that someone is doing well.

In the case of providers like Aspecto or Honeycomb or Lightstep, there is no notion of buying two separate products, and you’re paying to store the data twice; it's all one product. The vision is, if you can store the data once and materialize it in a time series plot, like metrics, or if you can visualize it in a trace waterfall graph, you shouldn't have to get charged twice for that.

In these services, like you said, it’s the classic, ‘land grab’. They enter with one product and up-sell and market their 130% net revenue retention number, and all investors love it.

It’s particularly interesting, because when we see a client already uses both Datadog and New Relic, or they’re using Datadog and New Relic and Splunk, and all three of those players say, why don't you kick off the other two and let us have all your business? It yields tool sprawl; you are trying to resolve problems in three contradictory systems that you're paying each to do the full work of.

Even if you adopt only one or two of those tools, you end up in this problem, as they’re still destroying product suites. They have different billing models, they have different data collection models, so it's hard to necessarily reconcile that. That’s where this value proposition is of a next generation tool to come in and say, let's ditch all these assumptions about how you need to do it, from the start.

I want to dig into that in detail in a moment. On the market more broadly, one thing I’ve never understood is why there are so many of these players. Like you said, there are five to seven players, all coming from a different angle or legacy advantage, but why is that the case, and why are there so many players?

I think it's because we are still figuring out what the best solutions are. I think this is a golden moment for innovation, and for people to figure out what the future of software delivery looks like, and what the future of observability looks like. I welcome the fact that people can try out different competitors and see what works best for them.

I think the reason why the market can support so many competing players is because of the size of the addressable market and the rate at which it’s expanding. As I mentioned, if only 20% of enterprises today use modern observability, that means the market will have potential to grow by a factor of five. I think that is an opportunity for the legacy. Let's call them ‘New Splunk Dog,’ New Relic, Splunk, Datadog. These players can branch out from their legacy offerings. You have the offerings started three to five years ago, folks like Honeycomb and Lightstep. You’ve got the brand-new players on the block, like Aspecto and SigNoz. There’s this opportunity for all of us to innovate and compete, and also cooperate in some ways to ensure that people have that freedom and flexibility.

They all seem to have the same offering as you said. We use Datadog, for example. I look at the invoice every month, and I have no clue how they’re charging that; similar to AWS. I go on the platform, and I see so many different products. The twenty-odd products they offer and Splunk are similar in what they’re offering. It’s an interesting market structure, where they all have a similar strategy and offering, where they are trying to build different products on the stack and be that all-encompassing platform.

I think every company inevitably does that. It’s hard once you reach a certain scale to get growth out of a single product, and then you must start selling multiple SKUs. I think that's what that boils down to.

What about the hyperscalers?

This is guided more from my experience as a former Google Cloud engineer. I worked on Stackdriver at Google Cloud. The way Google approached Stackdriver was that it was a table stakes thing, where it was not meant to be a profit center. It was just a value-added service that you got with Google Cloud. Now it’s known as Google Cloud Monitoring or Google Cloud Operation Suite, I think is the formal name. The idea is that customers expect to see basic data about what their load balancers are doing. They expect to be able to see records of the load balancer logs.

Every cloud provider has to provide that as an element of their stack, to help you understand the bits that you otherwise would not be able to data get out of. They have to come up with a mechanism to egress those logs or come up with a mechanism to service request rates, error rates, and so forth. They have built these products around that, that allow access to this proprietary data stream and say, okay, you can plug in data if you want to, from your own app. However, it's not necessarily a place where they're trying to reap significant revenue; it’s just a bare minimum that they need to keep their customers happy.

From what I've seen out of Amazon, it’s very similar. The CloudWatch product is not trying to compete head on with Datadog, Honeycomb or Splunk. I think it boils down to the fact that people doing a cloud migration will not want to ingest their on-prem logs into CloudWatch. That seems unnecessarily silly. An observability solution should be cloud neutral, I think it boils down to that. When you tie your observability to a specific cloud, you're locked into that cloud forever.

It is technically true, before Google acquired Stackdriver, Stackdriver had a solution for AWS. I think you can ingest your AWS logs into Google Cloud Operations, but again, why would you want to do that? It seems a little weird. I think what we've seen, with the willingness to collaborate on open data formats, from Google Cloud, from Amazon, is that they recognize that the best thing that helped them expand their total market base is not trying to lock you into their proprietary data platform. Instead, it is to make the data from that proprietary data platform available via open means, so that you can ingest it into a common tool, and that's what's going to provide them with the greatest growth on their core hyperscaler product.

It is when developers feel they can understand what's happening under the hood and analyze that. Also, that it doesn't have to be in a tool that they themselves are selling. You don't necessarily have to use CloudWatch and X-Ray to make Amazon happy; you just have to understand what’s happening in your app. Using the Insights service from CloudWatch, you potentially run into another tool like Honeycomb, and the application data being fed into Honeycomb. I think Amazon can have a baseline offering, but they're also willing to wait and see, and to let ISVs do this work around innovating on observability.

ISVs like Honeycomb, we spend a lot of money on AWS anyway. Data gravity means they will be selling at a markup; their service to the ISVs is providing observability. Inevitably, it’s not necessarily going to be today, but at some point, we will have enough customers on Google Cloud, where those customers will demand Honeycomb set up a copy of Honeycomb in Google Cloud, so they don't have to pay egress fees.

I think the argument is that hyperscaler observability solutions work okay for table stakes use cases. Then ISVs have to build their platforms on the hyperscaler’s cloud anyway. As long as the customers of the hyperscaler can get adequate observability, with one of many tools, those many tools will contribute to spending on the hyperscaler both directly and indirectly.

How do you look at Amazon or the hyperscalers’ incentives, as the ISVs or Datadog, or even Snowflake or any ISV that gets big and starts dominating a certain segment?

Fundamentally, the house always wins; in Las Vegas, the house always wins. I think that's what it boils down to. You can buy Datadog or Splunk, or Honeycomb, via an Amazon Marketplace Contract, so Amazon wins no matter what; the house always wins. They were always going to get their cut, whether it be from Honeycomb or Datadog, or all these providers spending money on AWS itself from their percentage cut of the AWS Marketplace sales. They make their buck no matter how this goes; it doesn't really matter how dominant or not any one player is. In fact, it's in Amazon's interest to make sure that no one player becomes dominant. That way they can play us off each other and not have to make as deep concessions on the percentage take from the Marketplace.

What do they think about OpenTelemetry, for example?

We first need to define OpenTelemetry, and then we can talk about the incentives. OpenTelemetry is a vendor neutral standard developed under the aegis of the Cloud Native Computing Foundation for how you generate and transmit telemetry data. When I talked earlier about distributed tracing, these start-stop timers, and recording all these key-value attribute pairs, OpenTelemetry provides the libraries, the SDKs, that enable you to generate that data and specify the format for that data, so that multiple vendors can ingest it. Even that you can transmit that data so that you can AB test multiple vendors against each other.

It is a mechanism to ensure that if you have invested in adding OpenTelemetry to your application or stack, you are not locked into a vendor; that is portable instrumentation that you can take to any provider. We’ve seen interest and contributions from Google, Amazon, Microsoft into the OpenTelemetry ecosystem, because they want to offer this type of takeout ability, so you can take out Amazon CloudWatch metrics in OpenTelemetry format.

The advantage to them is they only have one integration to maintain and write now. They can say, CloudWatch supports OTLP, the OpenTelemetry Protocol, egress. Then anyone who supports ingesting OTLP can just suck down CloudWatch Logs and integrates them into their product.

When you say portable, how portable is the data?

The data is portable to the point where you can change a single line in a YAML file and change where the data is sent. Or you can add five lines and send the data to two different syncs simultaneously. I think this is awesome. We're starting to see these competitive proof of concepts, where someone's interested in trying out Lightstep and Honeycomb side by side, and they can just write the instrumentation once, set up a collector fleet, and then have had the collectors configure to tee the data to both Honeycomb and to Lightstep simultaneously, and do a bake off. I think it's great for competition.

The reason why I ask how portable it is, is because I think we talked before about Kubernetes, and that was supposed to promise some kind of portability of workloads across the cloud. But then actually getting into the details, maybe it's not as portable as it may seem, given that there are lots of APIs linked to the cloud. How would you compare something like Kubernetes and that open-source ecosystem versus OpenTelemetry, for example?

I think it goes to the extent of the features that you want to use. With Kubernetes, the challenge is if you want to use the deep Amazon Load Balancer integration, you'll have to set some custom attributes. Similarly, if you want to use the deep Amazon EBS integration, you'll have to set some attributes and configure your disk volumes. The OpenTelemetry Collector works the same way. Baseline sending data in OTLP is super easy and efficient. If you want to configure some additional options, like if you want to configure tail sampling, that is something that the OpenTelemetry Collector supports only at a basic level. You might need something like Honeycomb’s Refinery product, which handles that tail sampling. Or you might need Lightstep’s Satellites, which similarly handles tail sampling.

I think the trade-off is that there are some things OpenTelemetry does well, as far as collecting the data from your applications easily. However, the management of the data volume at large scale is where you start to see some degree of vendor differentiation, where you're not sending it directly to the vendor, and sending the whole firehose of OTLP data directly to the vendor. The vendor might ask you to install their agent to filter and sift through the data first.

On the other hand, that's not changing the core of how OpenTelemetry works, that’s not changing any instrumentation code. That's just changing the routing behavior. I think that that level of decoupling makes it a lot friendlier than Kubernetes Manifests, where you must hand tailor them to the cloud that you're running it on.

Can you take a step back there and share – as we discussed how Datadog and Splunk have evolved in their business models – how could a new modern version of a DevOps platform or Observability platform, whether Honeycomb or Aspecto or Lightstep, how have they evolved relative to how Datadog evolved?

I think the evolution of the newer generation has been around market maturity. For instance, Honeycomb did not have a metrics product until a year and a half ago. I think that's this interesting evolution of saying that we are all in on tracing, but that there are customers that metrics and metrics are useful as a thing to look at alongside your traces. I think it boils down to what use cases do we address? How do we integrate up and down the stack? How flexible are we with being dogmatic about things, versus meeting people where they are?

Then you have certain things that relate to enterprise readiness. I think those are the areas that any David challenging a Goliath goes through. You must bulk up at a certain point and take on bigger and bigger fights. I don't think we've necessarily changed our core message about why our theory of change is different, but I think we are maturing our go-to-market approach as time goes on.

What is the core message these new modern version platforms are bringing to market?

I think the message is to empower developers first, rather than have central IT teams decide. I think the message is about looking at things clean sheet. Instead of using what worked five or 10 years ago, how do we help you address problems today and get answers quickly? Anyone who has used Splunk before for logs will tell you that while it’s a revolution going from 10 minutes to grep logs to five minutes or three minutes to run a query, it’s not five seconds. I think speed is a key area of differentiation for the newer providers. I think it's around getting people to rethink this data model. Do we have to have a separate logging provider, tracing provider and metrics provider? That I think is how we’re different.

Lightstep and Aspecto, are they built on top of OpenTelemetry, that protocol?

That’s correct. Lightstep was one of the founding companies that contributed a lot of the initial effort behind OpenTelemetry. Prior to that, they were one of the main innovators behind OpenTracing, which is one of the predecessors to OpenTelemetry. Aspecto was built first under OpenTelemetry, for the entire lifetime of their product. It's only been under OpenTelemetry, which I think is cool. Honeycomb joined OpenTelemetry and started making contributions, the same year it was first announced, in 2019.

In the case of Honeycomb, we had a legacy set of APIs and STKs that we developed in house, because there wasn't yet a clear winner between OpenCensus and OpenTracing, in 2016 and 2017 when we announced our tracing product. We had to retrofit and put Honeycomb Beelines in maintenance mode, and all go all in on OpenTelemetry for all net new business. I think that’s a generational thing, where Lightstep had to migrate their customers off OpenTracing, and we had to migrate our customers off Beelines. It is clear now that virtually every new observability client, regardless of a company, wants to use OpenTelemetry.

What percentage of the industry would you estimate is using OpenTelemetry today?

Out of the set of people that are using tracing or metrics at all? That is a complicated question, because I think it depends on signal type, and what solution someone is using. As I said earlier, if I estimate something like 20% of the industry is using any kind of modern observability product, then I would estimate something like 10% of the industry today is probably using OpenTelemetry.

So half of those customers?

Yes. I think today it stands at roughly half. I think basically all the new customers to modern observability, added in the past two years, essentially, are using O-Tel I would say.

What are they doing exactly? Let’s say I’m a big enterprise customer and I’m just exploring observability. Obviously, there’s Datadog out there. Do I use Datadog and then OpenTelemetry for different things? How do I use it?

I think it depends on who got to you first. Datadog has their proprietary dd-trace solution, which is integrated with their agents. You might have already locked into dd-trace, so I’m sorry, that’s painful and annoying. I spent a lot of time talking to the New York Times about that over the past two years, about being locked into dd-trace and what do we do now?

How sticky is that?

It’s unfortunately sticky, that’s the problem. When you have non-portable instrumentation, if you must rip out all these annotations and infrastructure, and all these lines of code that you've written, it's disruptive. However, I think that's why there is such a common interest in pursuing OpenTelemetry. None of the folks involved in OpenTelemetry want to put people through another rip and replace again, if we know it will destroy all our markets, if everyone must deal with all these incompatibilities. Datadog is coming around.

Datadog hired a director of OpenTelemetry strategy recently, who's a friend of mine, and they’re starting to make more of their products OpenTelemetry compatible, at least to receive OpenTelemetry data. My hope is they will liberate Datadog Trace to enable people using dd-trace to use any O-Tel compatible provider, but that's not yet there.

Then that effectively commoditizes them?

That’s the goal. The goal is to commoditize the integrations and SDKs. The goal is to make it so that the providers differ from the data analysis capability they provide, not on whoever your last provider was for tracing.

That’s a big move for Datadog?

It is, but I think customer pressure over the past three years has really forced Datadog to reconsider.

How do you see the major players like Datadog evolving and opening up?

I think this is a difference in strategy. Datadog started closed and is now trying to open up, while Splunk and New Relic have supported OpenTelemetry. Although New Relic recently pulled back from that and laid off a bunch of their OpenTelemetry engineers. Splunk has steadfastly expressed support for OpenTelemetry. They have realized that going it alone against Datadog was not a winning option. They would rather partner with Honeycomb and Lightstep, and honestly, with end users.

I think that's the other beautiful thing about OpenTelemetry, is that we have a growing number of end users contributing to the project. People value and appreciate having a community project where the community owns contributions upstream, not owned by the vendor. If you make a patch to dd-trace that only improves the experience of Datadog users. If you make a patch to OpenTelemetry, it impacts everyone.

Is it like everyone versus Datadog? That’s what it sounds like.

I don't think I can say that; you said it, not me. I think Datadog has changed their strategy recently considering the pressure from their customers to run OpenTelemetry. I think that says something. I think that makes it harder to characterize it as us versus them, so much as we have disagreeing strategies about what the right thing to do for consumers, is where they’ve come around.

For me as a customer of Datadog, what is the main advantage of me using OpenTelemetry versus propriety Datadog?

The advantage is that you can have the flexibility and freedom to benefit from improvements to the ecosystem made by other O-Tel users. You can not necessarily be tied forever to Datadog if you decide that another provider meets your needs better. You don't have to rip out all these lines of code adding instrumentation, because the instrumentation can move with you. I think that those are the two real advantages; a larger developer base working on the tooling, and portability to avoid lock-in.

Is it improving though, given the scale of these companies? I’d guess they improve their services quickly.

I think that’s the challenge that all of us saw, that Datadog had, at the time; 50 or 100 integration engineers and was working full time on integrations. Honeycomb, at the time, could only afford to spend one or two engineers working on integrations. It was not very tenable for us to compete in integration, but we wondered if we could pool resources to develop community-maintained integrations for all these projects. Even better yet, the magical thing about OpenTelemetry is that it is designed so that library authors, for instance Express Node.js library, the default HTTP web server that Express Node.js users use, can feel comfortable adding the OpenTelemetry APIs directly to their code, so that every Express user can benefit from OpenTelemetry out of the box. You don't have to install this kind of monkey patching integration; it will just work out a box.

Whereas the Express authors almost certainly did not want you to add Datadog’s proprietary code to their open-source tool that’s meant to work with everything. OpenTelemetry is sufficiently neutral that it’s something library authors can add, and it’s zero overhead if you’re not already using O-Tel. Basically, the idea is that you don't even need an integrations team if the library authors add O-Tel by default.

How do you see customers today using a proprietary solution like Honeycomb or Lightstep?

I think it's an evolution thing. I think that people will never want to rewrite your code. For the reason we discussed, of people not wanting to rewrite your code, you're never going to rip out something like Datadog that has wormed its way into your infrastructure. I think it's about those net new use cases built from OpenTelemetry from the start. Those built to send data to Honeycomb or Lightstep from the start. I think that's what we see; that growth of these Datadog custom metrics, that they charge you an arm and leg for, that slows or stops, and then maybe starts to reverse as people start ripping them out.

Once they discover there is a better solution for tracking high cardinality, high dimensionality data, that starts shifting into the OpenTelemetry data instead. It takes time to make that decision to cut Datadog loose, that doesn't happen overnight. Instead, the question is how to coexist for a period, and as a platform team, how to enable your software developers to use OpenTelemetry and modern observability?

If we used Datadog’s dd-trace or their metrics or infrastructure monitoring service, is that typically monitoring an enterprise’s whole stack, or is it by workload? Can I migrate those workloads over time, or is it that once I’m with Datadog, I’m stuck with them?

I think it’s a matter of, in the case of custom application metrics replacing them with the OpenTelemetry attributes. In the case of infrastructure metrics, it's a matter of running the right replacement agents to replace Datadog. Or to start ingesting data from Amazon CloudWatch via OpenTelemetry into your observability solution.

So it’s sticky then, what Datadog have?

I think their infrastructure and application metrics are sticky, and it's hard to remove them once they're there. However, again, this is a growing market, which is not necessarily a problem.

How do you see the pricing changing? Datadog is notoriously expensive, and how do you see that changing with the growth of OpenTelemetry?

The argument makes itself. Think about if your vendors hold you hostage and raise prices year on year, I have spoken to an unnamed financial services company, hiring two or three full-time engineers to manage the Datadog bill, and to submit pull requests to remove expensive tags. I understand the ROI of hiring $400,000 a year worth of engineering resources to cut your bill by $1,000,000 per year, but it is such a waste of engineering talent, we can do so much better. I think those arguments make themselves; if a product is notoriously sticky, difficult to remove, and expensive, you are breeding resentment against it, and people will be looking for the next better thing.

How do you compare Datadog or that kind of stickiness versus AWS and storage and compute stickiness?

Yes. It’s two different forms of gravity. With AWS, the gravity is around your data. That’s why they charge you nothing to ingest your data, because they want you to keep your data with them forever. In the case of Datadog, it’s stickiness in your code base. It’s all these lines of code and library calls out to Datadog, things that are hard to remove afterwards.

Datadog is less sticky than EC2?

Either way, you're going to spend engineering resources trying to do a full-scale migration. Yes, I think you're right, and it’s less sticky in that it's not all or nothing. Whereas with EC2 it is all or nothing, because if you want to access resources on EC2, if you're trying to execute a multi-cloud strategy, you’re still going to access things in Amazon, and you’re still going to be charged for accessing it somewhere else.

Given the shift we’re seeing in the industry, in OpenTelemetry and these new platforms, how do you think you get a competitive advantage as an observability platform today?

I think the competitive advantage is what you can do to solve software developer pain. The Stripe Developer Coefficient survey, from several years back, says software developers spent about 40% of the time doing break-fix work, that's end planned. That’s a huge waste across our industry, if you spend 40% of your time on end plan work, trying to chase down bugs. If we can reduce that percentage from 40% to 30% or 20%, that’s where you generate the value. The way you differentiate is by giving people faster answers and faster resolutions to their problems.

What’s limiting the adoption of OpenTelemetry?

Two years ago, I would have answered maturity. It was not mature enough, or that it was not generally available. We could scream until our face turned blue, saying it’s unstable in the sense that ATIs are subject to change; it’s not unstable in the sense that it will crash your app. That’s now put to bed, OpenTelemetry is here, it’s here to stay, it’s stable for any signals that are stable. I think now we have the incongruous challenge of complexity. OpenTelemetry is designed to be super flexible, super vendor neutral to support everything out-of-the-box, which means configuration can be a little bit annoying. It's unopinionated about so much that to get a working implementation, you might have to paste 100 lines of code.

Now there's an opportunity for quick starts for a vendor like Honeycomb or Lightstep to present their distribution, which has the opinionated defaults that work best with them. However, the customers custom attributes and instrumentation will work regardless of which one of our opinionated SDKs they plug in. I think that's the challenge; it’s similar to Kubernetes. Kubernetes started off as immature and would break your application, and now Kubernetes is complex. Honestly, complexity is not necessarily a bad problem if there are quick starts. Also, that means it will support and grow with any UCS you have.

For example, for Kubernetes, it seems like the biggest players are EKS, GKE, the hyperscalers. What makes you think that new independent players, like Honeycomb or Lightstep, will be the players for OpenTelemetry?

I think Amazon understands that OpenTelemetry is important, because it saves them from having to write a bunch of custom integrations with every ATM provider that comes down the road. That’s their incentive to play ball. As far as why they're not going to take over the observability market, it's simple; that is not a profit center for them. It is astonishingly hard to change the mind of a business about what its profit centers are versus what its costs centers or cost neutral centers are.

For the sake of innovation, I wish I could say Amazon is investing in making CloudWatch and X-Ray a profit center. I think the recent departures of certain prominent employees says something about whether they felt they could do their most innovative work in a cost center. The same thing happened with Google Cloud, which I can certainly more personally speak to. Melody Meckfessel left, she founded a startup – Observable – and I went to Honeycomb. When you see a brain drain from a company that has a product line around observability, that suggests that you're going to see them do maintenance, but not necessarily develop it as something that is their primary focus.

When we look at the potential risks, the major players, and the growth of Honeycomb, Aspecto and Lightstep, how do you see Datadog and New Relic combating the threat of the newer modern observability platform?

It comes down to discounting. The way we've seen them fight dirty. It’s not even fighting dirty; it’s fighting using the advantage they have. If they already have a contract with you for your log spend, for your security spend, they're going to offer you tracing for free. It's challenging to compete with free, even if we know that our solution will save far more developer time. People are still looking at this in terms of their budget, rather than spending some money in their budget to help their developers do more. I think that consolidation of vendors is a powerful market force right now, which is uphill for anyone trying to break into a company for the first time as a new vendor.

{audio:0:53.48} How do you see the market structure changing, with consolidation, and like I say, there are so many of them out there? How do you see that changing?

I think it's fascinating. I think any player offering a single feature will get snapped up. I think it's challenging to offer an incomplete set of features, rather than something that can address all, or most, of someone's needs. That’s the primary thing. I think the secondary thing is that we'll continue to see people wanting that flexibility to compete on price and features, but I think that's what the value proposition of OpenTelemetry is. To start seeing more movement between different OpenTelemetry supporting vendors. I think that's a win for the consumer. I think that is unequivocally good. I'm happy to make it easy for someone to switch away from Honeycomb, because if I think our product is the best, we're going to get far more new customers, and we’re going to bleed customers, that’s okay.

You mentioned roughly 50% of modern observability users are on OpenTelemetry, or at least using it in some way; how do you see that change over the next five years?

If we’re looking at 80% to 90% of new adopters of observability using OpenTelemetry, it basically says when we're at 30% to 40% of enterprises using modern observability, it’s going to be 35% out of the 40% using OpenTelemetry. People will move off legacy solutions to OpenTelemetry, and 100% of that new business will be on OpenTelemetry.

Do vendors do that? Do you see big enterprises only going with OpenTelemetry today?

Yes, we do see that. This is fascinating, that a major bank in APAC developed their platform entirely using OpenTelemetry first, and then they started looking for a vendor. I think that's cool, that people see the momentum behind O-Tel. People see that, just as you should be going with Kubernetes today, you should be building with O-Tel today. That it’s going to be there for you and it's going to be supported.

Is it difficult though? Because I understand that if you're going to run a Kubernetes orchestration, you need good engineers to manage that. How do you compare OpenTelemetry in terms of building it yourself?

I think it's going to be similar, at least for now. I think that to effectively leverage OpenTelemetry, you will have to have good engineers. I think it is on us all to lower the complexity and offer support to the people that were onboarding with OpenTelemetry.

A big challenge in terms of getting adoption is having that expertise internally, to build on O-Tel from scratch.

Yes, for sure, but I think that is part of the natural life cycle of every project.

Yes. Looking forward to five or 10 years; OpenTelemetry’s growing, how do you look at the industry structure between what the legacy players offer, like Datadogs and New Relics, who are bolting on products such as infrastructure monitoring, APM, logs, user experience, and what does that customer journey look like in terms of building their stack, and where does that leave the big players?

I think the aspiration is that I hope people will build first around this observability workflow, and around wide events and traces first. Also, they will have a choice of which vendors they use. Some may choose legacy providers that have modernized their observability solutions, and some will use new players. Even some that we haven't heard of yet, that haven't even been invented yet. I think there is always room for innovation.

What do you think of security?

Here’s the challenge; security is a fundamentally different market. This is where I think Splunk struggles. Splunk is pivoting much more towards the security market in terms of their market message. Maybe that's their market message for their logging product, which is fine. How should I put this? Observability is about empowering your developers to get to the bottom of a problem as quickly possible, even if using sampled or incomplete data. The idea is that fast and close to right, is better than perfect.

Security, by nature, must be exhaustive, you must have 100% of the login attempts to your servers. It might be okay for a query there to take 10 minutes or 30 minutes. I think of them as completely orthogonal problems; that I think is why OpenTelemetry is not necessarily designed for security purposes. I think logs are going to be the bread and butter of security so, in that sense, Splunk is doing well to emphasize security for their legacy product.

And Datadog is trying to get into security as well.

Indeed, Datadog is trying to get into the security market too. I’m happy to let Datadog and Splunk fight over the security market, because that is not where my attention is.

There could be single products, like you said, that will be on the legacy vendors, but it could be that the customers in the future are building OpenTelemetry first.

The difference is, OpenTelemetry first as far as developer experience, ease of debugging, and people can use whatever logging tool they want to use for security monitoring. For instance, Honeycomb has adopted Lacework. We think Lacework is great; they're not competing with us. They’re great because they solve a problem for us that we otherwise would not be able to solve on our own.

Effectively, OpenTelemetry commoditizes the core solution of these bigger DevOps players in terms of debugging the problem as quickly as possible for developers, and if customers do start from scratch, then they might bolt on other individual products like security or logs in the future, but you’ve effectively commoditized the platforms?

Right, exactly. The idea is to give people a default entry point of how you produce this data and give them choice in where they send it. One thing that I think is interesting, maybe this is based off your perspective as an analyst, but I’m really surprised you haven't mentioned, ‘Do it yourself.’ That is people who are building things with Jaeger, themselves. It's worth mentioning that that is a thing that people some people do, because they feel that they must keep all their data inside their own company, and they’re not going to use SaaS. Or they feel like they would rather customize and tailor something to their own needs that they build in-house.

This is what we said before, about the risk of public cloud is maybe private cloud, that nobody seems to be talking too much about.

There’s going to private clouds, there’s going to be people who are building their own custom solutions. Maybe you’re Uber and you need to build Jaeger and that’s fine, but there is a cost associated with doing that. Is this really your core business? Especially where we’re seeing the headcount budget shrink, that's very risky move, your manager now saying that they want to build a team of 10 people or 100 people to build out our own internal observability.

What do you think investors or people in the market typically misunderstand about observability and where the market’s going?

I think the number one incorrect assumption is that it's about, ‘Collect them all,’ and that it’s about logs, traces, metrics, and product suites. I don't think that's what it's about. I think it's about what capabilities you are unlocking for developers. When you focus on things, as the percentage of cloud spend, or as data volume, you are mistaking the value prop for how it might be billed. The number one thing to look at is to interview users of this and ask them what percentage of their organization is using this tool proficiently, and can they attribute the amount of time saved to this tool. For me, that’s what it boils down to; the smart money is on who can help organizations realize the most value.

Please Use PDF Download

DatadogDDOG

Related Content

Dynatrace: Cybersecurity & Observability Customer Perspective

Dynatrace: Datadog, User Personas & Switching Costs

Perimeter Solutions vs Fortress, FTAI, Intuitive Surgical, Capital One, TerraVest, Bonesupport, TDG, Ryerson, Dynatrace

TSLA, Brown & Brown, Rolex & Watches of Switzerland, DJCO, Belron, VEEV, bioMérieux, Dynatrace, TechTarget

Liz Fong-Jones

Principal Developer Advocate at Honeycomb and Former SRE Leader at Google

Interview Transcript