OTEL as a set of standards is admirable and ambitious, though in my experience actual implementation differs significantly between different vendors and they all seem to overcomplicate it.
That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?
We run with Debug logging on in prod for that reason too. We also ingest insane amounts of data but it does seem to be worth it for a sufficiently complex and important enough system to really have it all.
> That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?
You should have an answer, right? Like, in your case, you run a lot of logging, and you know why. So if it's off, you say "because it would cost X/million dollars a year and we decided not to do it."
Course, if you're the one who set it up, you should have the receipts on when that decision was made. This can be tricky sometimes because a lot of software dev ICs are strangely insulated from direct budgets, but if you're presented with an option that would be helpful but would cost a ton of money, it's generally a good thing to at least quickly run by someone higher up to confirm the desired direction.
I’ve used feature flags to manage logging verbosity and sample rate. It’s really nice to be able to go from logging very little to incrementally pump up the volume when there’s an incident.
> and leadership asks why you weren't logging everything in full fidelity?
I haven't been asked this question ever. In a way, I wish I was. I wish leadership was engaged in the details of the capabilities of the systems they lead.
But I don't anyone asking me this question any time soon either.
Sampling unconditionally at the start of the request is worth less than sampling at the end (so that your sample 1% of successful traces and be 100% of traces with issues).
Has anyone used OpenTelemetry for long-running batch jobs? OTel seems designed for web apps where spans last seconds/minutes, but batch jobs run for hours or days. Since spans are only submitted after completion, there's no way to track progress during execution, making OTel nearly unusable for batch workloads.
I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.
You could use span links for this. The idea is you have a bunch of discrete traces that indicate they are downstream or upstream of some other trace. You’d just have to bend it a bit to work in your probably single process batch executor !
I’ve implemented OTEL for background jobs, so async jobs that get picked up from the DB where I store the trace context in the DB and pass it along to multiple async jobs. For some jobs that fail and retry with a backoff strategy, they can take many hours and we can see the traces fine in grafana. Each job create its own span but they are all within the same trace.
Works well for us, I’m not sure I understand the issue you’re facing?
Ok after re reading I think you have issues with long running spans, I think you should break down your spans in smaller chunks. But a trace can take many hours or days, and be analysed even when it’s not finished
This is sort of all just a reframing of existing technologies.
Span = an event (which is bascially just a log with an associated trace), and some data fields.
Trace = a log for a request with a unique Id.
A useful thing about opentelemetry is that there's auto-instrumentation so you can get this all out-of-the-box for most JVM apps. Of course you could probably log your queries instead, so it's not necessarily a game-changer but a nice-to-have.
I always preach the isomorphism between traces and logs, but you left out the key thing. A span is a log entry associated with a trace, but the other key attributes of the span are its own unique identifier and a reference to the other event that caused the event. With those three attributes you can interpret the trace as a casual graph.
True. I think I’m emphasizing their similarities because what I’m seeing is companies treating them as unrelated (eg splunk and signalfx making entirely different query languages and visualization tools for logs vs spans)
Imo spans and logs should be understood as the same and displayed and queried the same (it’s trivial to add span id to each log), it almost feels like people are trying to make something trivially simple seem more substantial or complex
Traces and spans can be extended from or added to existing logging, but they aren't the same.
Logs are point in time, spans are a duration. Logs are flat, spans have a hierarchy.
It's the difference between logging a message in a function, and logging the beginning and end of a function while noting the specific instance of the fn caller.
If you have many threads or callers to the same function that difference is critical in tracing causality of failures or any other type of action of note.
I've been tasked with adding telemetry to an AWS based service at work:
CLI -> Web API Gateway -> Lambda returning a signed S3 URL
S3 upload -> SQS -> Lambda which writes to S3 and updates a Dynamo record -> CLI polls for changes
This flow isn't only over HTTP and relies on AWS to fire events. I worked around this by embedding the trace ID into the signed URL metadata. It doesn't look like this is possible with all AWS services.
I wonder if X-Ray can help here?
It can also be tedious to initialize spans everywhere. Aspects could help a lot here and orchestrion [0] is a good example of how it could be done in Go. I haven't found an OTEL equivalent yet (though haven't looked hard).
> Metrics tell you what changed. Logs tell you why something happened. Traces tell you where time was spent and how a request moved across your system.
maybe the first time i read a crystal clear difference between metrics, logs and traces.
It’s really not that bad, integrating it with dashboards is where I found most of the difficulty to be (due to bad documentation). I spent 4 days on implementing observability for this new backend project I’m working on. OTEL logging, tracing, and metric emission took less than a day to implement, instrumentation was very well documented. When I tried to integrate with Grafana dashboards, that’s when things started getting pretty frustrating…
Thanks for your comment! It has given me an idea for a project: a simple library that provides a Python decorator that can be used to include basic telemetry for functions in Python code (think spans with the input parameters as attributes): https://github.com/diegojromerolopez/otelize
There’s certainly some overhead, nothing is free. But the tradeoff is better insight into your system and better tools to validate issues when they arise. It can be very powerful in those scenarios.
Ive spent countless hours on issues where customers complain about performance or a bug and it just can’t be reproduced. Telemetry allows us to get more information to locate and fix these issues.
I don't really agree. It's mostly setup done once. Like configuring it and for example attaching some span generator to the library you use to talk with the database. Then future queries get it "for free". And just a single line if you want something custom if you have an annotation in java or using with in python for instance.
Trying to use OTel in any scenario outside of web backends such as desktop is a frustrating exercise in to trying to find exactly what small subset should use. I wish they had more examples of other types of software.
A while ago I was working on some CUDA kernels for n-body physics simulations. It wasn’t too complicated and the end result was generative art. The problem was that it was quite slow and I didn’t know why. Well the core of the application was written in Clojure so I wrote a simple macro to wrap every function in a ns with a span and then ship all the data to jaeger. This ended up being exactly what I needed - I found out that the two slowest functions were data transfer between the GPU memory and writing out a frame (image) to my disk.
In many other places I see the usefulness of this approach but OTel is too often too geared towards HTTP services. Even simple async/queue processing is not as simple. Though, there have been improvements (like span links and trace links).
I doubt "wraps" but almost certainly what you're shopping for is a correlation identifier on the (logs, traces, metrics) that would enable you to group the related requests. Sometimes just the session id can get you where you want to go, but in more complicated setups you may have to annotate from the client side to indicate "I'm doing these 5 things as part of this one logical operation"
OTEL as a set of standards is admirable and ambitious, though in my experience actual implementation differs significantly between different vendors and they all seem to overcomplicate it.
Plus that tens of terabytes of data you have to store for a week worth of traces
That's why you sample just enough instead of storing everything
That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?
We run with Debug logging on in prod for that reason too. We also ingest insane amounts of data but it does seem to be worth it for a sufficiently complex and important enough system to really have it all.
> That sounds great until you have a massive issue that costs the company real money and leadership asks why you weren't logging everything in full fidelity?
You should have an answer, right? Like, in your case, you run a lot of logging, and you know why. So if it's off, you say "because it would cost X/million dollars a year and we decided not to do it."
Course, if you're the one who set it up, you should have the receipts on when that decision was made. This can be tricky sometimes because a lot of software dev ICs are strangely insulated from direct budgets, but if you're presented with an option that would be helpful but would cost a ton of money, it's generally a good thing to at least quickly run by someone higher up to confirm the desired direction.
I’ve used feature flags to manage logging verbosity and sample rate. It’s really nice to be able to go from logging very little to incrementally pump up the volume when there’s an incident.
> and leadership asks why you weren't logging everything in full fidelity?
I haven't been asked this question ever. In a way, I wish I was. I wish leadership was engaged in the details of the capabilities of the systems they lead.
But I don't anyone asking me this question any time soon either.
Have you ever been asked “why didn’t we catch this sooner?”. I feel like it’s the same question worded differently
Its really two questions:
1. Why didn't we catch this sooner
2. Why did it take so long to mitigate
Without the debug logging #2 can be really tricky sometimes as well as you can be flying blind to some deep internal conditional branch firing off.
Sampling unconditionally at the start of the request is worth less than sampling at the end (so that your sample 1% of successful traces and be 100% of traces with issues).
We do. 0.5%
Has anyone used OpenTelemetry for long-running batch jobs? OTel seems designed for web apps where spans last seconds/minutes, but batch jobs run for hours or days. Since spans are only submitted after completion, there's no way to track progress during execution, making OTel nearly unusable for batch workloads.
I have a similar issue with Prometheus -- not great for batch job metrics either. It's frustrating how many otherwise excellent OSS tools are optimized for web applications but fall short for batch processing use cases.
You could use span links for this. The idea is you have a bunch of discrete traces that indicate they are downstream or upstream of some other trace. You’d just have to bend it a bit to work in your probably single process batch executor !
I’ve implemented OTEL for background jobs, so async jobs that get picked up from the DB where I store the trace context in the DB and pass it along to multiple async jobs. For some jobs that fail and retry with a backoff strategy, they can take many hours and we can see the traces fine in grafana. Each job create its own span but they are all within the same trace.
Works well for us, I’m not sure I understand the issue you’re facing?
Ok after re reading I think you have issues with long running spans, I think you should break down your spans in smaller chunks. But a trace can take many hours or days, and be analysed even when it’s not finished
Hm from what I’ve seen it emits metrics at a regular interval just like Prometheus. Maybe I’m thinking of something else though.
This is sort of all just a reframing of existing technologies.
Span = an event (which is bascially just a log with an associated trace), and some data fields. Trace = a log for a request with a unique Id.
A useful thing about opentelemetry is that there's auto-instrumentation so you can get this all out-of-the-box for most JVM apps. Of course you could probably log your queries instead, so it's not necessarily a game-changer but a nice-to-have.
Also the standardization is nice.
Span has a beginning and an end time. Event typically just has a time when it happened.
yeah, but spans can have events!
So, events are recursive?
I always preach the isomorphism between traces and logs, but you left out the key thing. A span is a log entry associated with a trace, but the other key attributes of the span are its own unique identifier and a reference to the other event that caused the event. With those three attributes you can interpret the trace as a casual graph.
True. I think I’m emphasizing their similarities because what I’m seeing is companies treating them as unrelated (eg splunk and signalfx making entirely different query languages and visualization tools for logs vs spans)
Imo spans and logs should be understood as the same and displayed and queried the same (it’s trivial to add span id to each log), it almost feels like people are trying to make something trivially simple seem more substantial or complex
Traces and spans can be extended from or added to existing logging, but they aren't the same.
Logs are point in time, spans are a duration. Logs are flat, spans have a hierarchy.
It's the difference between logging a message in a function, and logging the beginning and end of a function while noting the specific instance of the fn caller.
If you have many threads or callers to the same function that difference is critical in tracing causality of failures or any other type of action of note.
I've been tasked with adding telemetry to an AWS based service at work:
CLI -> Web API Gateway -> Lambda returning a signed S3 URL S3 upload -> SQS -> Lambda which writes to S3 and updates a Dynamo record -> CLI polls for changes
This flow isn't only over HTTP and relies on AWS to fire events. I worked around this by embedding the trace ID into the signed URL metadata. It doesn't look like this is possible with all AWS services.
I wonder if X-Ray can help here?
It can also be tedious to initialize spans everywhere. Aspects could help a lot here and orchestrion [0] is a good example of how it could be done in Go. I haven't found an OTEL equivalent yet (though haven't looked hard).
[0] - https://datadoghq.dev/orchestrion/docs/architecture/#code-in...
There’s an OTel SIG to do something similar / based on orchestrion and some other prior art - so just a matter of time !
> Metrics tell you what changed. Logs tell you why something happened. Traces tell you where time was spent and how a request moved across your system.
maybe the first time i read a crystal clear difference between metrics, logs and traces.
nice post.
The amount of additional code that it needs is horrible. We will now have to spend more brain juice on telemetry when working on a feature.
It’s really not that bad, integrating it with dashboards is where I found most of the difficulty to be (due to bad documentation). I spent 4 days on implementing observability for this new backend project I’m working on. OTEL logging, tracing, and metric emission took less than a day to implement, instrumentation was very well documented. When I tried to integrate with Grafana dashboards, that’s when things started getting pretty frustrating…
I work for Pydantic. We make Logfire, a commercial OTEL backend. But we’ve made wrappers around the OTEL SDKs in various languages that simplify configuration and usage. They can be used with any OTEL compatible backend (although we’d love if you try our SaaS offering): - JavaScript / Typescript: https://github.com/pydantic/logfire-js - Rust: https://github.com/pydantic/logfire-rust - Python: https://github.com/pydantic/logfire
Thanks for your comment! It has given me an idea for a project: a simple library that provides a Python decorator that can be used to include basic telemetry for functions in Python code (think spans with the input parameters as attributes): https://github.com/diegojromerolopez/otelize
Feedback welcome!
There’s certainly some overhead, nothing is free. But the tradeoff is better insight into your system and better tools to validate issues when they arise. It can be very powerful in those scenarios.
Ive spent countless hours on issues where customers complain about performance or a bug and it just can’t be reproduced. Telemetry allows us to get more information to locate and fix these issues.
I don't really agree. It's mostly setup done once. Like configuring it and for example attaching some span generator to the library you use to talk with the database. Then future queries get it "for free". And just a single line if you want something custom if you have an annotation in java or using with in python for instance.
Nah, if you have an important application this is very low cost for adding tons of insight into how your app is running.
What clicked for me is:
A span is a key-value attribute about some point in time event
A trace is a DAG of spans that tells you a story about some related events
What do you mean exactly by "point in time event"?
As I understand it, a metric is information at a point in time.
A span however has a start timestamp and end timestamp, and is about a single operation that happens across that time.
https://opentelemetry.io/docs/specs/otel/metrics/
vs
https://opentelemetry.io/docs/specs/otel/trace/api/#span
Trying to use OTel in any scenario outside of web backends such as desktop is a frustrating exercise in to trying to find exactly what small subset should use. I wish they had more examples of other types of software.
I agree. An anecdote:
A while ago I was working on some CUDA kernels for n-body physics simulations. It wasn’t too complicated and the end result was generative art. The problem was that it was quite slow and I didn’t know why. Well the core of the application was written in Clojure so I wrote a simple macro to wrap every function in a ns with a span and then ship all the data to jaeger. This ended up being exactly what I needed - I found out that the two slowest functions were data transfer between the GPU memory and writing out a frame (image) to my disk.
In many other places I see the usefulness of this approach but OTel is too often too geared towards HTTP services. Even simple async/queue processing is not as simple. Though, there have been improvements (like span links and trace links).
Nice summary at the start.
Is there anything that wraps multiple requests?
I doubt "wraps" but almost certainly what you're shopping for is a correlation identifier on the (logs, traces, metrics) that would enable you to group the related requests. Sometimes just the session id can get you where you want to go, but in more complicated setups you may have to annotate from the client side to indicate "I'm doing these 5 things as part of this one logical operation"
Good article. thanks for sharing
While I do like the comprehensive writeup, there's something about the style which triggers my "it's AI-generated" reflex...