I got OpenTelemetry to work. But why was it so complicated?

190 comments

·January 10, 2025

hinkley

The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.

And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.

OTEL is actively hostile to any language that uses one process per core. What a joke.

Just go with Prometheus. It’s not like there are other contenders out there.

to11mtm

I'm fairly convinced that OTEL is in a form of 'vendor capture', i.e. because the only way to get a standard was to compromise with various bigcorps and sloppy startups to glue-gun it all together.

I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.

I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.

KronisLV

> It’s not like there are other contenders out there.

Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/

Also I quite liked that a minimal setup is indeed pretty minimal: a web UI, a server instance and a DB that you already know https://skywalking.apache.org/docs/main/latest/en/setup/back...

In some ways, it's a lot like Zabbix in the monitoring space - neither will necessarily impress anyone, but both have a nice amount of utility.

sethops1

This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.

whalesalad

Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.

How would you build the "holy grail" map that shows a trace of every sub component in a transaction broken down by start/stop time etc... for instance show the load balancer see a request, the request get handled by middlewares etc, then go onto some kind of handler/controller, the sub-queries inside of that like database calls or cache calls. I don't think that is possible with prometheus?

baby_souffle

> Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.

Correct. Prometheus is just metrics.

The main argument for oTel is that instead of one proprietary vendor SDK or importing prometheus and jaeger and whatever you want to use for logging, just import oTel and all that will be done with a common / open data format.

I still believe in that dream but it's clear that the whole project needs some time/resources to mature a bit more.

If anybody remembers the Terraform/ToFu drama, it's been really wild to see how much support everybody pledged for ToFu but all the traditional observability providers have just kinda tolerated oTel :/

hinkley

Yeah part of the problem is it’s called Opentelemetry and half of you are only talking about tracing, not metrics. Telemetry is metrics. It’s been metrics since at least the Mercury Program.

Metrics in OTEL is about three years old and it’s garbage for something that’s been in development for three years.

tonyhart7

its looks hassle to implement ngl

niftaystory

Code traces are metrics. Run times per function calls metrics, count of specific function call metrics.

Otel is an attempt to package such arithmetic.

Web apps have added so many layers of syntax sugar and semantic wank, we’ve lost sight its all just the same old math operations relative to different math objects. Sets are not triangles but both are tested, quantified, and compared with the same old mathematical ops we learn by middle school.

paulddraper

Prometheus is good, but let's be clear...you don't get tracing.

PeterCorless

For tracing FOSS: Grafana Tempo.

https://grafana.com/oss/tempo/

Thaxll

You probably don´t understand what Otel is if you think that Prometheus is an alternative.

MathMonkeyMan

You'd do better to point out which distinction you think the parent poster is missing.

My guess is that Prometheus cannot do distributed tracing, while OpenTelemetry can. Is that what you meant?

seadan83

Why Otel compared to prometheus+syslog+(favorite way to do request tagging, eg: MDC in slf4j)+grep?

Syslog is kinda a pain, but it's an hour of work and log aggregation is set up. Is the difference the pain of doing simple things with elastic compute and kubernetes?

bushbaba

Simpler near-term, but more painful long term when you want to switch vendors/stacks.

kemitche

Nine times out of ten, I've got more valuable problems to solve than a theoretical future change of our vendor/stack for telemetry. I'll gladly borrow from my future self's time if it means I can focus on something more important right now.

hinkley

And switching log implementations can be a pain in the butt. Ask me how I know.

But I’d rather do that three more times before I want to see OpenTelemetry again.

Also Prometheus is getting OTEL interop.

pphysch

Is this the same scam as "standard SQL"? Switching database products is never straightforward in practice, despite any marketing copy or wishful thinking.

Prometheus ecosystem is very interoperable, by the way.

malkia

Using otel from C++ side... To have cumulative metrics from multiple applications (e.g. not "statds/delta") I create a relatively low cardinality process.vpid integer (and somehow coordinate this number to be unique as long as the app emitting it is stil alive) - you can use some global object to coordinate it.

Then you can have something that sums, and removes the attribute.

With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.

edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.

mkeedlinger

This matches my experience. Very difficult to understand what I needed to get the effect I wanted.

Xeago

I wonder what your experience is with Sentry? Not just for error reporting but especially also their support for traces.

Also open-source & self-hostable.

mdaniel

Likely only a handful of people care, but Sentry hasn't been open source in quite a while https://github.com/getsentry/sentry/blob/24.12.1/LICENSE.md (I'd have to do tag-spelunking to find the last Apache 2 version)

Glitchtip is the Sentry compatible open source (MIT) one https://gitlab.com/glitchtip/glitchtip-backend/-/blob/v4.2.2... with the extra advantage that it doesn't require like 12 containers to deploy (e.g. https://github.com/getsentry/self-hosted/blob/24.12.1/docker... )

mathfailure

Sentry is not horizontally scalable, thus ~ not-scalable at all, if your company is big.

Fidelix

That's a fair point, but scaling it vertically can take you very far in my experience.

malkia

Quota/pricing.

silisili

Same. I implemented Otel once and exactly once. I wouldn't wish it on any company.

Otel is a design by committee garbage pile of half baked ideas.

paulddraper

There are a lot of Java programmers working on it.

(And some Go tbf.)

hinkley

Yeah and a blind man can see this, it’s so loud.

rtuin

Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job. Also kudos to grafana for adopting OpenTelemetry as a first class citizen of their ecosystem.

I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.

To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.

My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js

saurik

> Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job.

Wait... so, the problem is that everyone makes it super easy, and so this product solves that by being complicated? ;P

to11mtm

The problem is that they make it super easy in very hacky ways and it becomes painful to improve things without startup money.

Also, per the hackiness, it tends to have visible perf impact. I know with dynatrace agent we had 0-1MS metrics pop up to 5-10ms (this service had a lot of traffic so it added up) and I'm pretty sure on .NET side there's issues around general performance of OTEL. I also know some of the work/'fun' colleagues have had to endure to make OTEL performant for their libs, in spite of the fact it was a message passing framework where that should be fairly simple...

laichzeit0

Well let's be fair. You can't get the type of telemetry Dyntrace provides "for free". You have to pay for it somewhere. Pretty sure you can exclude the agent from instrumenting performance critical parts of the code, if that is your concern.

to11mtm

> I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises

Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.

My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.

My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]

Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;

[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...

richbell

> I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts

Our containers regularly fail due vague LD_PRELOAD errors. Nobody has invested the time to figure out what the issue is because it usually goes away after restarting; the issue is intermittent and non-blocking, yet constant.

It's miserable.

a012

We do at least one rolling restart a day because it’s the best way to GC. And we’re not using any APM yet

EdwardDiego

Thank you! I'm very interested in that.

dimitar

It is as complicated as you want or need it to be. You can avoid any magic and stick to a subset that is easy to reason about and brings the most value in your context.

For our team, it is very simple:

* we use a library send traces and traces only[0]. They bring the most value for observing applications and can contain all the data the other types can contain. Basically hash-maps vs strings and floats.

* we use manual instrumentation as opposed to automatic - we are deliberate in what we observe and have great understand of what emits the spans. We have naming conventions that match our code organization.

* we use two different backends - an affordable 3rd party service and an all-on-one Jaeger install (just run 1 executable or docker container) that doesn't save the spans on disk for local development. The second is mostly for piece of mind of team members that they are not going to flood the third party service.

[0] We have a previous setup to monitor infrastructure and in our case we don't see a lot of value of ingesting all the infrastructure logs and metrics. I think it is early days for OTEL metrics and logs, but the vendors don't tell you this.

madeofpalk

It's as complicated as you want, but it's not as easy as I want. The floor is pretty high.

I'm still looking for an endpoint just to send simple one-off metrics to from parts of infrastructure that's not scrapable.

pat2man

You can just send metrics via JSON to any otlphttp collector: https://github.com/open-telemetry/opentelemetry-proto/blob/v...

madeofpalk

Shame none of this comes up whenever I search for it!

Top google result, for me, for 'send metrics to otel' is https://opentelemetry.io/docs/specs/otel/metrics/. If I go through the the Language APIs & SDK more whole bunch of useless junk https://opentelemetry.io/docs/languages/js/

Compare to the InfluxDB "send data" getting started https://docs.influxdata.com/influxdb/cloud/api-guide/client-... which gives you exactly it in a few lines.

pdimitar

https://github.com/openobserve/openobserve, more or less.

First time it takes 5 minutes to setup locally, from then on you just run the command in a separate terminal tab (or Docker container, they have an image too).

hinkley

I did not find that manual instrumentation made things simpler. You’re trading a learning curve that now starts way before you can demonstrate results for a clearer understanding of the performance penalties of using this Rube Goldberg machine.

Otel may be okay for a green field project but turning this thing on in a production service that already had telemetry felt like replacing a tire on a moving vehicle.

dmoy

I've not used otel for anything not greenfield, but I just wanted to say

> felt like replacing a tire on a moving vehicle.

Some people do this as a joke / dare. I mean literally replacing a car tire on a moving vehicle.

You Saudi drift up onto one side, and have people climb out of the side in the air, and then swap the tire while the car is driving on two wheels.

It's pretty insane stuff: https://youtu.be/Str7m8xV7W8?si=KkjBh6OvFoD0HGoh

hinkley

That was the image I had in my head.

My whole career I’ve been watching people on greenfield projects looking down on devs on already successful products for not using some tool they’ve discovered, missing the fact that their tool only functions if you build your whole product around the exact mental model of the tool (green field).

Wisdom is learning to watch for people obviously working on brownfield projects espousing a tool. Like moving from VMs to Docker. Ansible to Kubernetes (maybe not the best example). They can have a faster adoption cycle and more staying power.

PeterCorless

SaS Institute used that exact same analogy & even this video in their talk about implementing ScyllaDB back in 2020 (check out 0:35 in the video):

https://www.scylladb.com/2020/05/28/sas-institute-changing-a...

Seems like moving to OTel might even be a bit more complex for some brownfield folks.

buzzdenver

Mind sharing that that affordable 3rd party service is?

dimitar

honeycomb

mikestorrent

Very sane advice. Most folks will already have something for metrics and logs and unless there's ROI on changing it out, why bother?

Groxx

>You can avoid any magic and stick to a subset...

... if (and only if) all the libraries you use also stick to that subset, yea. That is overwhelmingly not true in my experience. And the article shows a nice concrete example of why.

For green-field projects which use nothing but otel and no non-otel frameworks, yea. I can believe it's nice. But I definitely do not live in that world yet.

junto

One of my biggest problems was the local development story. I wanted logs, traces and metrics support locally but didn’t want to spin up a multitude of Docker images just to get that to work. I wanted logs to be able to check what my metrics, traces, baggage and activity spans look like before I deploy.

Recently, the .NET team launched .NET Aspire and it’s awesome. Super easy to visualize everything in one place in my local development stack and it acts as an orchestrator as code.

Then when we deploy to k8s we just point the OTEL endpoint at the DataDog Agent and everything just works.

We just avoid the DataDog custom trace libraries and SDK and stick with OTEL.

Now it’s a really nice development experience.

https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals...

https://docs.datadoghq.com/opentelemetry/#overview

rochacon

> I wanted logs, traces and metrics support locally but didn’t want to spin up a multitude of Docker images just to get that to work.

This project is really nice for that https://github.com/grafana/docker-otel-lgtm

masterj

There's also https://github.com/CtrlSpice/otel-desktop-viewer

pdimitar

Just use https://github.com/openobserve/openobserve.

Takes 5 minutes to set it up locally on your dev machine the first time, from then on you can just have a separate terminal tab where you simply run `/path/to/openobserve` and that's it. They also offer a Docker image for local and remote running as well, if you don't want to have the grand complexity of a single statically-linked binary. :P

It's an all-in-one fully compliant OpenTelemetry backend with pretty graphs. I love it for my projects, hasn't failed me in any detectable way yet.

WuxiFingerHold

I'm not convinced by .NET Aspire. It solves a small problem (service discovery and orchestration for local development of multi service projects). But it solves this by making service discovery and orchestration an application level concern. With Aspire you needlessly add complexity at the app level and get locked into a narrow ecosystem. There are many proven alternatives like docker compose for local development. Aspire is not even that much if at all easier than using docker compose and env vars.

Thaxll

There are official all-in-one docker image that have everything.

BiteCode_dev

If you are doing otel with python, use Logfire's client... even if you don't use their offering.

It's foss, and ypu can point it to any otel compat enpoint. Plus the client that the pydantic team made is 10 times better and simpler than the official otel lib.

Samuel Colvin has a cool intervew where he explains how he got there: https://www.bitecode.dev/p/samuel-colvin-on-logfire-mixing-p...

edenfed

Definitely can relate, this is why I started an open-source project that focus on making OpenTelemetry adoption as easy as running a single command line: https://github.com/odigos-io/odigos

pat2man

A lot of web frameworks etc do most of the instrumentation for you these days. For instance using opentelemetry-js and self hosting something like https://signoz.io should take less than an hour to get spun up and you get a ton of data without writing any custom code.

pranay01

Agree. Here's the repo for SigNoz if you want to check it out - https://github.com/signoz/signoz

hocuspocus

Context propagation isn't trivial on a multi-threaded async runtime. There are several ways to do it, but JVM agents that instrument bytecode are popular because they work transparently.

hinkley

While that’s true, if you’ve already solved punching correlation-IDs and A/B testing (feature flags per request) through then you can use the same solution for all three. In fact you really should.

Ours was old so based on domain <dry heaving sounds>, but by the time I left the project there were just a few places left where anyone touched raw domains directly and you could switch to AsyncLocalStorage in a reasonable amount of time.

The simplest thing that could work is to pass the original request or response context everywhere but that… has its own struggles. It’s hell on your function signatures (so I sympathize with my predecessors not doing that but goddamn) and you really don’t want an entire sequence diagram being able to fire the response. That’s equivalent to having a function with 100 return statements in it.

deepsun

Same thing. OpenTelemetry grew up from Traces, but Metrics and Logs are much better left to specialized solutions.

Feels like a "leaky abstraction" (or "leaky framework") issue. If we wanted to put everything under one umbrella, then well, an SQL database can also do all these things at the same time! Doesn't mean it should.

PeterCorless

Cramer wants to get traces out of OTel. Which is ironic because he's one of the creators of OpenTracing.

https://cra.mr/the-problem-with-otel/

deepsun

He also started Sentry, so must know a thing or two on the topic.

incangold

I think giving metrics and logging a location in a trace is really useful.

But I still dislike OTel every time I have to deal with it.

hinkley

You can’t do fine grained tracing in OTEL because if you hit 500 spans in a single trace it starts dropping the trace. Basically a toy solution for brownfield work.

IneffablePigeon

This is just not true. We have traces with hundreds of thousands of spans. Those are not very readable but that’s another problem.

pranay01

As mentioned by philip below, 500 spans is a very small amount. I have seen customers send 1000s of spans in a trace very easily

phillipcarter

...huh? I work with customers who (through a mistake) have created literally multi-million span traces using OTel. Are you referring to a particular backend?

BugsJustFindMe

If you get to the end you find that the pain was all self-inflicted. I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

baby_souffle

> I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

Yes, but only if everything in your stack is supported by their auto instrumentation. Take `aiohttp` for example. The latest version is 3.11.X and ... their auto instrumentation claims to support `3.X` [0] but results vary depending on how new your `aiohttp` is versus the auto instrumentation.

It's _magical_ when it all just works, but that ends up being a pretty narrow needle to thread!

[0]: https://github.com/open-telemetry/opentelemetry-python-contr...

BugsJustFindMe

> their auto instrumentation claims to support `3.X`

Semver should never be treated as anything more than some tired programmer's shrug and prayer that nobody else notices the breakages they didn't notice themselves. Pin strict dependencies instead of loose ones, and upgrade only after integration testing.

There are only two kinds of updates, ones that intend to break something and ones that don't intend to break something, and neither one guarantees that the intent matches the outcome.

baby_souffle

> Semver should never be treated as anything more than some tired programmer's shrug and prayer that nobody else notices the breakages they didn't notice themselves.

That's precisely my point, but you said it better :).

I have had _mixed_ results getting auto instrumentation working reliably with packages that are - technically - supported.

verall

So recently I needed to this up for a very simple flask app. We're running otel-collector-contrib, jaeger-all-in-one, and prometheus on a single server with docker compose (has to be all within the corpo intranet for reasons..)

Traces work, and I have the spanmetrics exporter set up, and I can actually see the spanmetrics in prometheus if I query directly, but they won't show up in the jaeger "monitor" tab, no matter what I do.

I spent 3 days on this before my boss is like "why don't we just manually instrument and send everything to the SQL server and create a grafana dashboard from that" and agh I don't want to do that either.

Any advice? It's literally the simplest usecase but I can't get it to work. Should I just add grafana to the pile?

pdimitar

Try https://github.com/openobserve/openobserve, it's extremely easy to self-host and it's an all-in-one solution, dashboards included (though admittedly I've seen prettier ones).

BugsJustFindMe

Yeah the biggest trouble really is on the dashboarding side of things, not the sending side, and is why there are popular SaaS products like datadog. If you're amenable to saas, datadog is probably the best way. Otherwise, look into SigNoz for a one-stop solution with minimal effort even if there are some rough edges still.

verall

We absolutely have to run it ourselves (...corporate reasons...), it's a lightweight service with only a few hundred users so we haven't had to worry much about perf (yet).

SigNoz does look interesting, I may give this a shot, thank you. I'm a bit concerned about it conflicting with other things going on in our docker-compose but it doesn't look too bad..

etimberg

Until you run your server behind something like gunicorn and all of the auto imports stop working and you have to do it all yourself.

jdsleppy

I found this to work fine https://opentelemetry-python.readthedocs.io/en/latest/exampl...

jdsleppy

...but with manually running autoinstrumentation in the post fork hook.

I guess there is a lot of undocumented magic in OTel...

BugsJustFindMe

It works with uwsgi just fine though.

nimish

It's complicated because it's designed for the companies selling Otel compatible software, not the engineers implementing it

andy800

Not sure about this, I think the vendors were happy with their own proprietary code, agents and backends because the lock-in ensures that switching costs (in terms of writing all new code) are very high.

paulddraper

That hasn't been what I've seen from the contributors.

If anything I think the backends were kinda slow to adopt.

convolvatron

this is going to come off as being fussy, but 'implement' use to refer to the former activity, not the latter. which is fine, meanings change, its just amusing that we no longer have a word we can use for 'sitting down and writing software to match a specification' and only 'taking existing software and deploying it on servers'

skrebbel

This has been the case for ages. Sysadmins use "implement" to mean "install software on servers and keep it running", coders use "implement" to mean "code stuff that matches a spec/interface". It's just two worlds accidentally using the same term for a different thing. No meanings are changing. Two MS certified sysadmins in 1999 could talk about how they were "Implementing Exchange across the whole company".

hinkley

Operational versus builder jargon.

stronglikedan

It's still implementing. Someone has taken the specifications and implemented the software, and then someone else has taken the software and implemented a solution with it.

dboreham

Author is trying to do something difficult with a non-batteries-included open source (free to them) product. Seems quite uncomplicated given the circumstances. The whole point of OTel is to not get bent over backwards by one of the SaaS "logging/tracing/telemetry" companies, and as such it's going to incur some cost/pain of its own, but typically the bargain is worth taking.

6r17

I have implemented OTEL over numerous projects to retrieve traces. It's just a total pain and I'd 500% skip it for anything else.

HN

I got OpenTelemetry to work. But why was it so complicated?

I got OpenTelemetry to work. But why was it so complicated?