Skip to content(if available)orjump to list(if available)

OpenTelemetry for Go: Measuring overhead costs

sa46

Funny timing—I tried optimizing the Otel Go SDK a few weeks ago (https://github.com/open-telemetry/opentelemetry-go/issues/67...).

I suspect you could make the tracing SDK 2x faster with some cleverness. The main tricks are:

- Use a faster time.Now(). Go does a fair bit of work to convert to the Go epoch.

- Use atomics instead of a mutex. I sent a PR, but the reviewer caught correctness issues. Atomics are subtle and tricky.

- Directly marshal protos instead of reflection with a hand-rolled library or with https://github.com/VictoriaMetrics/easyproto.

The gold standard is how TiDB implemented tracing (https://www.pingcap.com/blog/how-we-trace-a-kv-database-with...). Since Go purposefully (and reasonably) doesn't currently provide a comparable abstraction for thread-local storage, we can't implement similar tricks like special-casing when a trace is modified on a single thread.

rastignack

Would the sync.Pool trick mentionned here: https://hypermode.com/blog/introducing-ristretto-high-perf-g... help ? It’s lossy but might be a good compromise.

malkia

There is an effort to use arrow format for metrics too - https://github.com/open-telemetry/otel-arrow - but no client that exports directly to it yet.

reactordev

Mmmmmmm, the last 8 months of my life wrapped into a blog post but with an ad on the end. Excellent. Basically the same findings as me, my team, and everyone else in the space.

Not being sarcastic at all, it’s tricky. I like that the article called out eBPF and why you would want to disable it for speed but recommends caution. I kept hearing from executives a “single pane of glass” marketing speak and I kept my mouth shut about how that isn’t feasible across the entire organization. Needless to say, they didn’t like that non-answer and so I was canned. What an engineer cared about is different from organization/business metrics and often the two were confused.

I wrote a lot of great otel receivers though. VMware, Veracode, Hashicorp Vault, GitLab, Jenkins, Jira, and the platforms itself.

phillipcarter

> I kept hearing from executives a “single pane of glass” marketing speak

It's really unfortunate that Observability vendors lean into this to reinforce it too. What the execs usually care about is engineering workflows consolidating and allowing teams to all "speak the same language" in terms of data, analysis workflows, visualizations, runbooks, etc.

This goal is admirable, but nearly impossible to achieve because it's the exact same problem as solving "we are aligned organizationally", which no organization ever is.

That doesn't mean progress can't be made, but it's always far more complicated than they would like.

reactordev

For sure, it’s the ultimate nirvana. Let me know when an organization gets there. :)

Thaxll

Logging, metrics and traces are not free, especially if you turn them on at every requests.

Tracing every http 200 at 10k req/sec is not something you should be doing, at that rate you should sample 200 ( 1% or so ) and trace all the errors.

anonzzzies

A very small % of startups gets anywhere near that traffic so why give them angst? Most people can just do this without any issues and learn from it and a tiny fraction shouldn't.

cogman10

Having high req/s isn't as big a negative as it once was. Especially if you are using http2 or http3.

Designing APIs which cause a high number of requests and spit out a low amount of data can be quite legitimate. It allows for better scaling and capacity planning vs having single calls that take a large amount of time and return large amounts of data.

In the old http1 days, it was a bad thing because a single connection could only service 1 request at a time. Getting any sort of concurrency or high request rates require many connections (which had a large amount of overhead due to the way tcp functions).

We've moved past that.

williamdclt

10k/s across multiple services is reached quickly even at startup scale.

In my previous company (startup), we’d use Otel everywhere and we definitely needed sampling for cost reasons (1/30 iirc). And that was using a much cheaper provider than Datadog

orochimaaru

Metrics are usually minimal overheard. Traces need to be sampled. Logs need to be sampled at error/critical levels. You also need to be able to dynamically change sampling and log levels.

100% traces are a mess. I didn’t see where he setup sampling.

phillipcarter

The post didn't cover sampling, which indeed, significantly reduces overhead in OTel because the spans that aren't sampled aren't ever created, when you head sample at the SDK level. This is more of a concern when doing tail-based sampling only, wherein you will want to trace each request and offload to a sidecar so that export concerns are handled outside your app. And then it routes to a sampler elsewhere in your infrastructure.

FWIW at my former employer we had some fairly loose guidelines for folks around sampling: https://docs.honeycomb.io/manage-data-volume/sample/guidelin...

There's outliers, but the general idea is that there's also a high cost to implementing sampling (especially for nontrivial stuff), and if your volume isn't terribly high then you'll probably eat a lot more in time than paying for the extra data you may not necessarily need.

jhoechtl

I am relatively new to the topic. In the sample code of the OP there is no logging right? It's metrics and traces but no logging.

How is logging in OTel?

vanschelven

The article never really explains what eBPF is -- AFAIU, it’s a kernel feature that lets you trace syscalls and network events without touching your app code. Low overhead, good for metrics, but not exactly transparent.

It’s the umpteenth OTEL-critical article on the front page of HN this month alone... I have to say I share the sentiment but probably for different reasons. My take is quite the opposite: most value is precisely in the application (code) level so you definetly should instrument... and then focus on Errors over "general observability"[0]

[0] https://www.bugsink.com/blog/track-errors-first/

nikolay_sivko

I'm the author. I wouldn’t say the post is critical of OTEL. I just wanted to measure the overhead, that’s all. Benchmarks shouldn’t be seen as critique. Quite the opposite, we can only improve things if we’ve measured them first.

politician

I don't want to take away from your point, and yet... if anyone lacks background knowledge these days the relevant context is just an LLM prompt away.

vanschelven

It was always "a search away" but on the _web_ one might as well use... A hyperlink

coxley

The OTel SDK has always been much worse to use than Prometheus for metrics — including higher overhead. I prefer to only use it for tracing for that reason.

otterley

Out of curiosity, does Go's built-in pprof yield different results?

The nice thing about Go is that you don't need an eBPF module to get decent profiling.

Also, CPU and memory instrumentation is built into the Linux kernel already.

dmoy

Not on original topic, but:

I definitely prefer having graphs put the unit at least on the axis, if not in the individual axis labels directly.

I.e. instead of having a graph titled "latency, seconds" at the top and then way over on the left have an unlabeled axis with "5m, 10m, 15m, 20m" ticks...

I'd rather have title "latency" and either "seconds" on the left, or, given the confusion between "5m = 5 minutes" or "5m = 5 milli[seconds]", just have it explicitly labeled on each tick: 5ms, 10ms, ...

Way, way less likely to confuse someone when the units are right on the number, instead of floating way over in a different section of the graph

null

[deleted]

jeffbee

I feel like this is a lesson that unfortunately did not escape Google, even though a lot of these open systems came from Google or ex-Googlers. The overhead of tracing, logs, and metrics needs to be ultra-low. But the (mis)feature whereby a trace span can be sampled post hoc means that you cannot have a nil tracer that does nothing on unsampled traces, because it could become sampled later. And the idea that if a metric exists it must be centrally collected is totally preposterous, makes everything far too expensive when all a developer wants is a metric that costs nothing in the steady state but can be collected when needed.

mamidon

How would you handle the case where you want to trace 100% of errors? Presumably you don't know a trace is an error until after you've executed the thing and paid the price.

phillipcarter

This is correct. It's a seemingly simple desire -- "always capture whenever there's a request with an error!" -- but the overhead needed to set that up gets complex. And then you start heading down the path of "well THESE business conditions are more important than THOSE business conditions!" and before you know it, you've got a nice little tower of sampling cards assembled. It's still worth it, just a hefty tax at times, and often the right solution is to just pay for more compute and data so that your engineers are spending less time on these meta-level concerns.

jeffbee

I wouldn't. "Trace contains an error" is a hideously bad criterion for sampling. If you have some storage subsystem where you always hedge/race reads to two replicas then cancel the request of the losing replica, then all of your traces will contain an error. It is a genuinely terrible feature.

Local logging of error conditions is the way to go. And I mean local, not to a central, indexed log search engine; that's also way too expensive.

phillipcarter

I disagree that it's a bad criterion. The case you describe is what sounds difficult, treating one error as part of normal operations and another as not. That should be considered its own kind of error or other form of response, and sampling decisions could take that into consideration (or not).