Ask HN: Combine central monitoring with platform embedded monitoring

2 points by user568439 a year ago

My company is using a central monitoring tool (Datadog) and the rule is that everything should be monitored there.

However, the landscape for which I'm responsible has it's own monitoring tool which comes with a lot of out-of-the-box settings and advanced options including dedicated applications for different kind of monitoring (health status, performance, end-2-end, etc...). It also offers an OpenTelemetry API that can be called by the central monitoring tool to fetch all the data.

If I use the dedicated monitoring tool, I benefit from a lot of options like filtering, analytics, drill-down, seamless navigation, almost ready E2E monitroing, etc... I only have to configure some thresholds and decide if they trigger an alert, create a ticket or start some automation. But I'm told that everything should be visible in the central tool and also only this one can create incident tickets and alerts for the 1st line support.

If the central monitoring is to be used, then I basically have to manually replicate configuration/code to process the OpenTelemetry data. I also lose a lot of flexibility because I'm not the owner of the tool and the team responsible for it doesn't understand my landscape and doesn't react fast to any change I require.

All in all I would still be using the dedicated tool to investigate the issues, because it provides much more detailed info with near-zero effort. Therefore the only benefit of the central tool is that 1st line support would see the status in their dashboard and also would have a bit more understanding of the tickets they get since they link history of tickets and resolution outcomes to their monitors.

I don't want to go rogue monitoring my landscape and I also benefit from 1st line support having a bit of awareness of the landscape. But besides of that I would like to use the dedicated tool.

Do you have an idea on how to better combine both options? My first idea is to aggregate the monitoring for the central tool to be much less granular and just detect something like "There is an issue with Health Monitoring for the system XXX". While the dedicated tool would provide the details like "Certificate YYY of system XXX is going to expire in 15 days".

However I must be granular enough to control priorities and ensure the alert is sent to the correct support team. This already forces me to start reworking things that I have readily available when setting the threshold in my dedicated tool.

fhwang a year ago

How is the data getting into these tools in the first place? If your applications are instrumented with OpenTelemetry, you can use the OpenTelemetry Collector as the first hop, and then you might be able to send the data to both tools at the same time.

https://opentelemetry.io/docs/collector/

user568439 a year ago

There is a proprietary agent in my system collecting the data and sending it to the monitoring tool in a proprietary format. It's still quite simple and easy to parse though.
However getting the data in Datadog is not a problem. The problem is losing ready to use monitoring tools and apps designed to work for my specific software and having to configure a much more flexible tool as Datadog to do things I already have available.