Skip to content(if available)orjump to list(if available)

CI/CD Observability with OpenTelemetry Step by Step Guide

hrpnk

Has anyone seen OTel being used well for long-running batch/async processes? Wonder how the suggestions stack up to monolith builds for Apps that take about an hour.

makeavish

You can use SpanLinks to analyse your async processes. This guide might be helpful introduction: https://dev.to/clericcoder/mastering-trace-analysis-with-spa...

Also SigNoz supports rendering practically unlimited number of spans in trace detail UI and allows filtering them as well which has been really useful in analyzing batch processes: https://signoz.io/blog/traces-without-limits/

You can further run aggregation on spans to monitor failures and latency.

PS: I am SigNoz maintainer

zdc1

I've tried and failed at tracing transactions that span multiple queues (with different backends). At the end I just published some custom metrics for the transaction's success count / failure count / duration and moved on my with life.

dboreham

It doesn't matter how long things take. The best way to understand this is to realize that OTel tracing (and all other similar things) are really "fancy logging systems". Some agent code emits a log message every time something happens (e.g. batch job begins, batch job ends). Something aggregates those log messages into some place they can be coherently scanned. Then something scans those messages generating some visualization you view. Everything could be done with text messages in text files and some awk script. A tracing system is just that with batteries included and a pretty UI. Understood this way it should now be clear why the duration of a monitored task is not relevant -- once the "begin task" message has been generated all that has to happen is the sampling agent remembers the span ID. Then when the "end task" message is emitted it has the same span ID. That way the two can be correlated and rendered as a task with some duration. There's always a way to propagate the span ID from place to place (e.g. in a http header so correlation can be done between processes/machines). This explains sibling comments about not being able to track tasks between workflows: the span ID wasn't propagated.

imiric

That's a good way of looking at it, but it assumes that both start and end events will be emitted and will successfully reach the backend. What happens if one of them doesn't?

candiddevmike

AIUI, there aren't really start or end messages, they're spans. A span is technically an "end" message and will have parent or child spans.

lijok

Depends on the visualization system. It can either not display the entire trace or communicate to the user that the start of the trace hasn’t been received or the trace hasn’t yet concluded. It really is just a bunch of structured log lines with a common attribute to tie them together.

madduci

I use Otel running in a GKE cluster and tracking Jenkins jobs, whose spans/traces can track long time running jobs pretty well

totetsu

I spent some time working on this. First I tried to make a GitHub action that was triggered on completion of your other actions and passed along the context of the triggering action in the environment, then used the GitHub api to call out extra details of the steps and tasks etc, and the logs and make that all into a process trace and send it via an otel connection to like jaeger or grafana, to get flamchart views of performance of steps. I thought maybe it would be better to do this directly from the runner hosts by watching log files, but the api has more detailed information.

reactordev

As someone who has some experience in observability at scale, the issue with SigNoz, Prom, etc is that they can only operate on the data that is exposed by the underlying infrastructure where the IaaS has all the information to provide a better experience. Hence CloudWatch.

That said, if you own your infrastructure, I’d build out a signoz cluster in a heartbeat. Otel is awesome but once you set down a path for your org, it’s going to be extremely painful to switch. Choose otel if you’re a hybrid cloud or you have on premises stuff. If you’re on AWS, CloudWatch is a better option simply because they have the data. Dead simple tracing.

FunnyLookinHat

I think you're looking at OTel from a strictly infrastructure perspective - which Cloudwatch does effectively solve without any added effort. But OTel really begins to shine when you instrument your backends. Some languages (Node.js) have a whole slew of auto-instrumentation, giving you rich traces with spans detailing each step of the http request, every SQL query, and even usage of AWS services. Making those traces even more valuable is that they're linked across services.

We've frequently seen a slowdown or error at the top of our stack, and the teams are able to immediately pinpoint the problem as a downstream service. Not only that, they can see the specific issue in the downstream service almost immediately!

Once you get to that level of detail, having your infrastructure metrics pulled into your Otel provider does start to make some sense. If you observe a slowdown in a service, being able to see that the DB CPU is pegged at the same time is meaningful, etc.

[Edit - Typo!]

makeavish

Agree with you on this. OTel agents allows exporting all host/k8s metrics correlated with your logs and traces. Though exporting AWS service specific metrics with OTel is not easy. To solve this SigNoz has 1-Click AWS Integrations: https://signoz.io/blog/native-aws-integrations-with-autodisc...

Also SigNoz has native correlation between different signals out of the box.

PS: I am SigNoz Maintainer

elza_1111

FYI for anyone reading, OTel does have great auto-instrumentation for Python, Java and .NET also

elza_1111

There are integrations that let you monitor your AWS resources also on SigNoz. That said, I personally think CloudWatch is painful in so many other ways as well,

Check this out, https://signoz.io/blog/6-silent-traps-inside-cloudWatch-that...

6r17

I did have some bad experiences with OTEL and have lot of freedom on deployment ; I never read of Signoz will definitely check it out ; SigNoz is working with OTEL I suppose ?

I wonder if there are any other adapters for trace injest instead of OTEL ?

bbkane

There are a few: I've played with https://uptrace.dev and https://openobserve.ai/ . OpenObserve is a single binary, so easy to set up

darkstar_16

Jaeger collector perhaps but then you'd have to use the Jaeger UI. Signoz has a much nicer UI that feels more integrated but last I checked had annoying bugs in the UI like not keeping the time selection when I navigated between screens.

6r17

Definitely should look up the tech more ; i lazily commented as Signoz clearly state it ingest most than 50 different sources ;

elza_1111

yep, SigNoz is OpenTelemetry native. You can instrument your application with OpenTelemetry and send telemetry data direclty to signoz.

candiddevmike

How does SigNoz compare to the other "all-in-one" OTel platforms? What part of the open-core bit is behind a paywall?

sali0

noob question, i'm currently adding telemetry to my backend.

I was at first implementing otel throughout my api, but ran into some minor headaches and a lot of boilerplate. I shopped a bit around and saw that Sentry has a lot of nice integrations everywhere, and seems to have all the same features (metrics, traces, error reporting). I'm considering just using Sentry for both backend and frontend and other pieces as well.

Curious if anyone has thoughts on this. Assuming Sentry can fulfill our requirements, the only thing taht really concerns me is vendor-lockin. But I'm wondering other people's thoughts

vrosas

Think of otel as just a standard data format for your logs/traces/metrics that your backend(s) emit, and some open source libraries for dealing with that data. You can pipe it straight to an observability vendor that accepts these formats (pretty much everyone does - datadog, stackdriver, etc) or you can simply write the data to a database and wire up your own dashboards on top of it (i.e. graphana).

Otel can take a little while to understand because, like many standards, it's designed by committee and the code/documentation will reflect that. LLMs can help but the last time I was asking them about otel they constantly gave me code that was out of date with the latest otel libraries.

srikanthccv

>I was at first implementing otel throughout my api, but ran into some minor headaches and a lot of boilerplate

OTeL also has numerous integrations https://opentelemetry.io/ecosystem/registry/. In contrast, Sentry lacks traditional metrics and other capabilities that OTeL offers. IIRC, Sentry experimented with "DDM" (Delightful Developer Metrics), but this feature was deprecated and removed while still in alpha/beta.

Sentry excels at error tracking and provides excellent browser integration. This might be sufficient for your needs, but if you're looking for the comprehensive observability features that OpenTelemetry provides, you'd likely need a full observability platform.

whatevermom

Sentry isn’t really a full on observability platform. It’s for error reporting only (that is annotated with traces and logs). It turns out that for most projects, this is sufficient. Can’t comment on the vendor lock-in part.

dboreham

You can run your own sentry server (or at least last time I worked with it you could). But as others have noted sentry is not going to provide the same functionality as OTel.

bravesoul2

That's a genius idea. So obvious in retrospect.