Skip to content(if available)orjump to list(if available)

I got OpenTelemetry to work. But why was it so complicated?

hinkley

The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.

And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.

OTEL is actively hostile to any language that uses one process per core. What a joke.

Just go with Prometheus. It’s not like there are other contenders out there.

sethops1

This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.

whalesalad

Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.

How would you build the "holy grail" map that shows a trace of every sub component in a transaction broken down by start/stop time etc... for instance show the load balancer see a request, the request get handled by middlewares etc, then go onto some kind of handler/controller, the sub-queries inside of that like database calls or cache calls. I don't think that is possible with prometheus?

niftaystory

Code traces are metrics. Run times per function calls metrics, count of specific function call metrics.

Otel is an attempt to package such arithmetic.

Web apps have added so many layers of syntax sugar and semantic wank, we’ve lost sight its all just the same old math operations relative to different math objects. Sets are not triangles but both are tested, quantified, and compared with the same old mathematical ops we learn by middle school.

baby_souffle

> Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.

Correct. Prometheus is just metrics.

The main argument for oTel is that instead of one proprietary vendor SDK or importing prometheus and jaeger and whatever you want to use for logging, just import oTel and all that will be done with a common / open data format.

I still believe in that dream but it's clear that the whole project needs some time/resources to mature a bit more.

If anybody remembers the Terraform/ToFu drama, it's been really wild to see how much support everybody pledged for ToFu but all the traditional observability providers have just kinda tolerated oTel :/

bushbaba

Simpler near-term, but more painful long term when you want to switch vendors/stacks.

hinkley

And switching log implementations can be a pain in the butt. Ask me how I know.

But I’d rather do that three more times before I want to see OpenTelemetry again.

Also Prometheus is getting OTEL interop.

kemitche

Nine times out of ten, I've got more valuable problems to solve than a theoretical future change of our vendor/stack for telemetry. I'll gladly borrow from my future self's time if it means I can focus on something more important right now.

pphysch

Is this the same scam as "standard SQL"? Switching database products is never straightforward in practice, despite any marketing copy or wishful thinking.

Prometheus ecosystem is very interoperable, by the way.

Xeago

I wonder what your experience is with Sentry? Not just for error reporting but especially also their support for traces.

Also open-source & self-hostable.

mkeedlinger

This matches my experience. Very difficult to understand what I needed to get the effect I wanted.

paulddraper

There are a lot of Java programmers working on it.

(And some Go tbf.)

hinkley

Yeah and a blind man can see this, it’s so loud.

rtuin

Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job. Also kudos to grafana for adopting OpenTelemetry as a first class citizen of their ecosystem.

I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.

To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.

My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js

Groxx

[delayed]

dimitar

It is as complicated as you want or need it to be. You can avoid any magic and stick to a subset that is easy to reason about and brings the most value in your context.

For our team, it is very simple:

* we use a library send traces and traces only[0]. They bring the most value for observing applications and can contain all the data the other types can contain. Basically hash-maps vs strings and floats.

* we use manual instrumentation as opposed to automatic - we are deliberate in what we observe and have great understand of what emits the spans. We have naming conventions that match our code organization.

* we use two different backends - an affordable 3rd party service and an all-on-one Jaeger install (just run 1 executable or docker container) that doesn't save the spans on disk for local development. The second is mostly for piece of mind of team members that they are not going to flood the third party service.

[0] We have a previous setup to monitor infrastructure and in our case we don't see a lot of value of ingesting all the infrastructure logs and metrics. I think it is early days for OTEL metrics and logs, but the vendors don't tell you this.

Groxx

[delayed]

madeofpalk

It's as complicated as you want, but it's not as easy as I want. The floor is pretty high.

I'm still looking for an endpoint just to send simple one-off metrics to from parts of infrastructure that's not scrapable.

pat2man

You can just send metrics via JSON to any otlphttp collector: https://github.com/open-telemetry/opentelemetry-proto/blob/v...

madeofpalk

Shame none of this comes up whenever I search for it!

Top google result, for me, for 'send metrics to otel' is https://opentelemetry.io/docs/specs/otel/metrics/. If I go through the the Language APIs & SDK more whole bunch of useless junk https://opentelemetry.io/docs/languages/js/

Compare to the InfluxDB "send data" getting started https://docs.influxdata.com/influxdb/cloud/api-guide/client-... which gives you exactly it in a few lines.

hinkley

I did not find that manual instrumentation made things simpler. You’re trading a learning curve that now starts way before you can demonstrate results for a clearer understanding of the performance penalties of using this Rube Goldberg machine.

Otel may be okay for a green field project but turning this thing on in a production service that already had telemetry felt like replacing a tire on a moving vehicle.

buzzdenver

Mind sharing that that affordable 3rd party service is?

nimish

It's complicated because it's designed for the companies selling Otel compatible software, not the engineers implementing it

convolvatron

this is going to come off as being fussy, but 'implement' use to refer to the former activity, not the latter. which is fine, meanings change, its just amusing that we no longer have a word we can use for 'sitting down and writing software to match a specification' and only 'taking existing software and deploying it on servers'

skrebbel

This has been the case for ages. Sysadmins use "implement" to mean "install software on servers and keep it running", coders use "implement" to mean "code stuff that matches a spec/interface". It's just two worlds accidentally using the same term for a different thing. No meanings are changing. Two MS certified sysadmins in 1999 could talk about how they were "Implementing Exchange across the whole company".

stronglikedan

It's still implementing. Someone has taken the specifications and implemented the software, and then someone else has taken the software and implemented a solution with it.

hinkley

Operational versus builder jargon.

BugsJustFindMe

If you get to the end you find that the pain was all self-inflicted. I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

verall

So recently I needed to this up for a very simple flask app. We're running otel-collector-contrib, jaeger-all-in-one, and prometheus on a single server with docker compose (has to be all within the corpo intranet for reasons..)

Traces work, and I have the spanmetrics exporter set up, and I can actually see the spanmetrics in prometheus if I query directly, but they won't show up in the jaeger "monitor" tab, no matter what I do.

I spent 3 days on this before my boss is like "why don't we just manually instrument and send everything to the SQL server and create a grafana dashboard from that" and agh I don't want to do that either.

Any advice? It's literally the simplest usecase but I can't get it to work. Should I just add grafana to the pile?

BugsJustFindMe

Yeah the biggest trouble really is on the dashboarding side of things, not the sending side, and is why there are popular SaaS products like datadog. If you're amenable to saas, datadog is probably the best way. Otherwise, look into SigNoz for a one-stop solution with minimal effort even if there are some rough edges still.

baby_souffle

> I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

Yes, but only if everything in your stack is supported by their auto instrumentation. Take `aiohttp` for example. The latest version is 3.11.X and ... their auto instrumentation claims to support `3.X` [0] but results vary depending on how new your `aiohttp` is versus the auto instrumentation.

It's _magical_ when it all just works, but that ends up being a pretty narrow needle to thread!

[0]: https://github.com/open-telemetry/opentelemetry-python-contr...

etimberg

Until you run your server behind something like gunicorn and all of the auto imports stop working and you have to do it all yourself.

BugsJustFindMe

It works with uwsgi just fine though.

pat2man

A lot of web frameworks etc do most of the instrumentation for you these days. For instance using opentelemetry-js and self hosting something like https://signoz.io should take less than an hour to get spun up and you get a ton of data without writing any custom code.

hocuspocus

Context propagation isn't trivial on a multi-threaded async runtime. There are several ways to do it, but JVM agents that instrument bytecode are popular because they work transparently.

hinkley

While that’s true, if you’ve already solved punching correlation-IDs and A/B testing (feature flags per request) through then you can use the same solution for all three. In fact you really should.

Ours was old so based on domain <dry heaving sounds>, but by the time I left the project there were just a few places left where anyone touched raw domains directly and you could switch to AsyncLocalStorage in a reasonable amount of time.

The simplest thing that could work is to pass the original request or response context everywhere but that… has its own struggles. It’s hell on your function signatures (so I sympathize with my predecessors not doing that but goddamn) and you really don’t want an entire sequence diagram being able to fire the response. That’s equivalent to having a function with 100 return statements in it.

pranay01

Agree. Here's the repo for SigNoz if you want to check it out - https://github.com/signoz/signoz

deepsun

Same thing. OpenTelemetry grew up from Traces, but Metrics and Logs are much better left to specialized solutions.

Feels like a "leaky abstraction" (or "leaky framework") issue. If we wanted to put everything under one umbrella, then well, an SQL database can also do all these things at the same time! Doesn't mean it should.

incangold

I think giving metrics and logging a location in a trace is really useful.

But I still dislike OTel every time I have to deal with it.

hinkley

You can’t do fine grained tracing in OTEL because if you hit 500 spans in a single trace it starts dropping the trace. Basically a toy solution for brownfield work.

pranay01

As mentioned by philip below, 500 spans is a very small amount. I have seen customers send 1000s of spans in a trace very easily

IneffablePigeon

This is just not true. We have traces with hundreds of thousands of spans. Those are not very readable but that’s another problem.

phillipcarter

...huh? I work with customers who (through a mistake) have created literally multi-million span traces using OTel. Are you referring to a particular backend?

cedws

I still don’t understand what OTEL is. What problem is it solving? If it’s a standard what is the change for the end user? Is it not just a matter of continuing to use whatever (Prometheus, Grafana, etc) with the option to swap components out?

hangonhn

For the tracing part of Otel, neither Prometheus nor Grafana are capable of doing that. Tracing is the most mature part of Otel and the most compelling use case for it. For metrics, we've stayed with Prometheus and AWS Cloudwatch Metrics. The metrics part feels very under developed at the moment.

hinkley

When I last looked 9 months ago, there were libraries of the metrics side of the tree still marked as experimental, that you couldn’t successfully send metrics without using. And a huge memory leak in the JS implementation that was only fixed 15 months ago: https://github.com/open-telemetry/opentelemetry-js/issues/41...

Things, especially crosscutting concerns, you want to use in production should have stopped experiencing basic growing pains like this long before you touch them. It’s not baked yet. Come back in a year. Or two.

barake

Everything is either in development or stable. There aren't statuses like alpha, beta, release candidate, etc. except for individual library releases. Metric clients will be marked as "development" until it goes "stable" [0]. Consequently it can be hard to determine the actual maturity level of any given implementation.

Tracing is very mature, with metric and logging implementations stable for a number of popular languages [1].

the "experimental" status was renamed "development"

[0] https://opentelemetry.io/docs/specs/otel/versioning-and-stab...

[1] https://opentelemetry.io/docs/languages/#status-and-releases

dionian

i can report the same traces to jager if i want open source or i switch out the provider and it can go to aws x-ray (paid). without any code or config changes. pretty useful. yes, a tad clumsy to set up the first time.

paulddraper

The point of OTel is interoperability.

For example the author of the software instruments it with OTel -- either language interface or wire protocol -- and the operator of the software uses the backend of choice.

Otherwise, you have a combinatorial matrix of supported options.

(Naturally, this problem is moot if the author and operator are the same.)

hinkley

Interoperability with what?

Where are the three existing, successful solutions it is trying to abstract over?

It doesn’t know what it is because it’s violating the Rule of Three.

arccy

interoperability between vendors, so your business isn't stuck with a vendor who can raise prices because their SDKs are deeply embedded in your codebase, so open source libraries / products have a common point to hook into without needing to integrate with each vendor.

jiggawatts

Application Insights, Data Dog, New Relic, etc…

APM products in general.

GauntletWizard

Interoperability with the other things your Otel Vendor is selling you. No two implementations are even remotely compatible, but they can all mostly scrape data from your Prometheus endpoints, so it's easy to migrate from useful software to their walled garden.

edenfed

Definitely can relate, this is why I started an open-source project that focus on making OpenTelemetry adoption as easy as running a single command line: https://github.com/odigos-io/odigos

6r17

I have implemented OTEL over numerous projects to retrieve traces. It's just a total pain and I'd 500% skip it for anything else.

cglan

I agree. I tried to get it to work recently with datadog, but there was so many hiccups. I ended up having to use datadogs solution mostly. The documentation across everything is also kind of confusing

SomaticPirate

imo Datadog is pretty hostile to OTel too. Ever since https://github.com/open-telemetry/opentelemetry-collector-co... was nearly killed by them I never felt like they fully supported the standard (perhaps for good reasons)

OTel is a bear though. I think the biggest advantage it gives you is the ability to move across tracing providers

rikthevik

> the ability to move across tracing providers

It's a nice dream. At Google Cloud Next last year, the vendors kinda of came in two buckets. Datadog, and everyone trying to replace Datadog's outrageous bills.

bebop

I worry that vision is not going to become reality if the large observability vendors don't want to support the standard.

phillipcarter

FWIW the "datadog doesn't like otel" thing is kind of old hat, and the story was a little more complicated at the time too.

Nowadays they're contributing more to the project directly and have built some support to embed the collector into their DD agent. Other vendors (splunk, dynatrace, new relic, grafana, honeycomb, sumo logic, etc.) contribute to the project a bunch and typically recommend using OTel to start instead of some custom stuff from before.

hangonhn

Yeah their agent will accept traces from the standard Otel SDK but there is no way to change their SDK to send the traces to anyone other than Datadog when I last checked a couple(?) of years ago.

I mean I understand why they did that but it really removes one of the most compelling parts about Otel. We ended doing the hard work of using the standard Otel libraries. I had to contribute a PR or two to get it all to work with our services but am glad that's the route we went because now we can switch vendors if needed (which is likely in the not too distant future in our case.

ljm

The biggest barrier to setting up oTel for me is the development experience. Having a single open specification is fantastic, especially for portability, but the SDKs are almost overwhelmingly abstract and therefore difficult to intuit.

I used to really like Datadog for being a one-stop observability shop and even though the experience of integrating with it is still quite simple, I think product and pricing wise they've jumped the shark.

I'm much happier these days using a collection of small time services and self-hosting other things, and the only part of that which isn't joyful is the boilerplate and not really understanding when and why you should, say, use gRPC over HTTP, and stuff like that.

pranay01

part of the reason for that experience is also because DataDog is not open telemetry native and all their docs and instructions encourage use of their own agents. Using DataDog with Otel is like trying to hold your nose round over your head

You should try Otel native observability platforms like SigNoz, Honeycomb, etc. your life will be much simpler

Disclaimer : i am one of the maintainers at SigNoz