Monitoring My Homelab, Simply

b.tuxes.uk

162 points by Bogdanp 5 days ago

Jedd a day ago

> So I have two dead man switch’s.

I am reminded of an aphorism about having a problem and deciding to use regex.

> Historical data: I’m not chasing down grand mysteries that require fleet-wide aggregate metrics.

Everyone believes this .. until it isn't true, and then you find yourself needing logs from the last two weeks.

For home labs, log aggregation is an easy problem to deal with these days, and a secure sink to send all your logs to has (potentially) more than one benefit.

Anecdote - I've just been tracking some unpleasant FLUSH CACHE EXT errors on my aging pre-owned Xeon box, and having an understanding of frequency / distribution of those errors on the hypervisor, but also correlation with different but related errors presenting in the VMs, was a) very useful, b) not something I'd have predicted I'd need before hand.

reboot81 a day ago

Just love https://healthchecks.io I set it up on all my boxes with these two scripts: win https://github.com/reboot81/healthchecks_service_ps/ macos https://github.com/reboot81/hc_check_maker_macos linux https://healthchecks.io/docs/bash/

danesparza a day ago

Uptime Kuma (https://github.com/louislam/uptime-kuma). With email notifications. So much simpler, and free.

szszrk 11 hours ago

Last time I checked it lacked real API or a way to be configured via config files.
I find Gatus much more thought through.
- https://github.com/TwiN/gatus
- danesparza 10 hours ago
  
  Not that I think that an API is fundamental to a monitoring tool, but it happens to have one: https://github.com/louislam/uptime-kuma/wiki/API-Documentati...
  
  szszrk 10 hours ago
  
  It matters if you have a lot of services and want to store config in git, or do auto discovery. Even on medium homelab it's important.
  Not necessarily an API, but a config file would be nice.
  Also... there is a big disclaimer at the very top of the page.
- Havoc 9 hours ago
  
  There is a python library to load stuff into kuma. Just need password and url
  
  szszrk 6 hours ago
  
  That would be a workaround.
  Do you recall the library name?
seriocomic a day ago

Love this - tried it. The problem as I see it is that these still require hosting - ideally (again, as I see it) self-hosting a script that monitors internal/homelab things also requires its own monitoring.
Short of paying for a service (which somewhat goes against the grain of trying to host all your own stuff), the closest I can come up with is relying on a service outside your network that has access to your network (via a tunnel/vpn).
Given a lot of my own networking set-up (DNS/Domains/Tunnels etc) are already managed via Cloudflare, I'm thinking that using some compute at that layer to provide a monitoring service. Probably something to throw next at my new LLM developer...
- hammyhavoc 18 hours ago
  
  UptimeFlare looks promising—runs in a Cloudflare Worker: https://github.com/lyc8503/UptimeFlare
  If anybody wants to be a clever clogs, combining both this and Uptime Kuma would be genius. What I want is redundancy. E.g., if something can't be reached, check on the other, likewise if one service takes a crap, continue monitoring via the other and sync up the histories once they're both back online.
  This "local or cloud" false dichotomy makes no sense to me—a hybrid approach would be brilliant.
  If anyone manages this, email me: me@hammyhavoc.com. I would love to hear about it.

imiric 19 hours ago

This is interesting. I appreciate the simplicity and DIY aspects. Is it available as a repo somewhere?

I recently had to troubleshoot a hanging issue on one of my servers, so I needed something that could ship logs. The modern observability stack is a deep pit of complexity, but OpenTelemetry is a standard, and there are reasonably simple tools in the ecosystem. I knew I didn't want a behemoth like Grafana, and I was aware of SigNoz, though it seems janky. Then I stumbled upon OpenObserve, and it looked promising. Setting it up on a spare mini PC and opentelemetry-collector on the server was pretty straightforward. Getting the collector configuration right took some trial and error, though.

I have to say, I'm quite satisfied with this setup. I ended up installing the collector on other machines, so it's almost like a proper observability system now :)

The graphs are nice. I can expand it to monitor anything else I would need. I haven't setup alerts yet, but it's possible.

I'm not really concerned about monitoring the monitor. It's not a big deal for my use case if it goes down. Metrics and logs will be submitted when it's back up, since they're cached on the servers. Besides, I'm only running OpenObserve on the machine, so there aren't many moving parts.

Anyway, all this is to say that sometimes there's more to be gained from using off-the-shelf tooling instead of rolling your own, even if it involves more complexity. Server monitoring is an old tradition, and there are many robust solutions out there. OTLP isn't that bad, especially at smaller scales, and it opens the door to a large ecosystem. It would be foolish not to take advantage of that.

Scaevolus a day ago

I use Prometheus + Prometheus Alertmanager + Any Free Tier paging system (currently OpsGenie, might move to AlertOps).

Having a feature-rich TSDB backing alerting minimizes time adding alerts, and the UX of being able to write a potential alert expression and seeing when in the past it would fire is amazing.

Just two processes to run, either bare or containerized, and you can throw in a Grafana instance if you want better graphs.

scrapheap 17 hours ago

Prometheus fronted by grafana is great and I use it a lot for work, but I can understand why they don't want to deal with it for just monitoring their home network - and writing your own monitoring software can certainly help you appreciate what you get from Prometheus.

Evidlo a day ago

My solution is to just be OK with http status checking (run a webserver on important machines), and use a service like updown.io which is so cheap it's almost free.

e.g. For 1 machine, hourly checking is ~$0.25/year

remram a day ago

Do you do regular backups? If your backup system breaks and stop making new backups, what will let you know? What if your RAID is failing, running out of space, remounted read-only after an error?
I have found that "machine is online" is usually not what I need monitoring for, at all. I'll notice if it's down. It's all the mission-critical-but-silently-breakables that I bother to monitor.
- dervjd 20 hours ago
  
  Not OP, https://healthchecks.io is great for monitoring automated tasks like backup scripts. Also has the option to immediately signal failure and send an alert: https://healthchecks.io/docs/signaling_failures/
  
  remram 8 hours ago
  
  That's what I use for cron-type things. Experience has been great. I also run it as a watchdog in my alertmanager container, so I am alerted if the alerts are broken.
- beingflo 11 hours ago
  
  updown.io also has a relatively new feature called cron monitoring[0] that allows you to regularly check in to signal success. If there has been no check-in in a configured time it will alert you. For backups you could add a simple curl somewhere into your backup process to do just that.
  [0] https://updown.io/doc/how-pulse-cron-monitoring-works

Tractor8626 a day ago

Even in homelab you should totally monitor thing like

- raid health

- free disk space

- whether backup jobs running

- ssl certs expiring

ahofmann a day ago

One could also look every Sunday at 5 pm manually through this stuff. In a homelab, this can be enough.
- sthuck a day ago
  
  Look I agree but one can also manage with an always on pc and an external hard drive instead of a homelab. It's part hobby part learning experience.
  Also if you have kids 0-6 you can't schedule anything relaibly
- tough a day ago
  
  One cold also just wait for things to not work before to try and fix them
  
  dewey a day ago
  
  For backups that's usually not the best strategy.
  
  zamadatix 10 hours ago
  
  Assuming backups are the route you want to go, it all depends on how you use your "homelab". To some people homelab is more cattle than pet and it's easier to just hit redeploy rather than restore. To others homelab means the place they store their family photos, run game servers for friends, or do their work from.
  Because of the wide breadth of what a homelab can mean it's really hard to make universal statements about what is always good. By the style of the article the author probably wants backups & failure notifications in some way if it's not already covered outside of their custom monitoring.

nullify88 19 hours ago

Grafana Cloud have quite a nice free tier. If you use Alloy, you don't need any persistence locally. Everything just gets shipped off.

Similar to the author, I want to run a minimalist monitoring setup and currently just use Glances. But Grafana Cloud might be my first choice if I need to expand the setup.

sjsdaiuasgdia 11 hours ago

I was using the Grafana Cloud free tier for a couple years, then they made some changes and started yelling at me about alerts I had set up based on Loki log searches. A new metric appeared in the usage dashboard...there were now limits for how I access the logs I've stored, on top of the limits for how much log data I could store.
I did figure out some ways to reduce the log query usage of the alerts and made the "you need to upgrade to a paid tier!" notices stop. Still, the experience was the straw that broke the camel's back. I'd already been getting somewhat frustrated by the 2 week retention and 10 dashboard limit.
FWIW, it wasn't too difficult to stand up the Docker containers for Grafana, Loki, and Prometheus for my own usage.

compumike a day ago

I appreciate the "How to monitor the monitor?" section. Always need a meta-monitor! :)

Hope you might give us a try at https://heiioncall.com/ and let me know if that fits. (Disclosure: our team is building and operating it as a simple monitoring, alerting, on-call rotations solution.) We have the cron job heartbeats, HTTP healthchecks, SSL certificate expiration, etc etc all in one simple package. With mobile app alerts (critical and non-critical), business hours rules, etc. And a free tier for homelabbers / solo projects / etc. :)

Edit: since you mentioned silencing things in your post, we also have very flexible "silence" buttons, which can set silence at various levels of the hierarchy, and can do so with a predefined timer. So if you know you want things to be silenced because you're fixing them, you can click one button and silence that trigger or group of triggers for 24 hours -- and it'll automatically unsilence at that time -- so you don't have to remember to manually manage silence/unsilence status!

Havoc a day ago

I personally found uptime kuma to be easiest because it has a python api package to bulk load stuff into it.

Much easier to edit a list in vscode than click around a bunch in an app

hammyhavoc 18 hours ago

My gripe is a lack of support for multiple users. E.g., a family member gets sick of receiving a notification and can't just toggle it themselves.

valeriansaliou 19 hours ago

This resembles how I monitor all infrastructure I run. One of them has 150 small independent VMs, for which I had to build a custom micro service monitoring open source tool that I still use to this day: https://github.com/valeriansaliou/vigil

There’s no certificate expiration monitoring just yet, but everything else is there: poll probes (active ICMP or TCP probes), push probes (reporting HTTP API for apps), and local probes (reporting HTTP API for sub-Vigil for firewalled infrastructure parts).

manugarg 17 hours ago

https://github.com/cloudprober/cloudprober (https://cloudprober.org) fits the bill here. It is a simple yet powerful monitoring tool. You can have a minimal as well as an extensive setup with it, without the bloat or the complexity.

(Full disclosure: I'm the author)

justusthane a day ago

I’ve been facing a similar search for an ultra-simple but ultra-extensible monitoring solution for my homelab. I’ve had the idea to write a Python program where the main script is just responsible for scheduling and executing the checks, logging, and alerting based on set thresholds.

All monitoring would be handled via plugins, which would be extremely easy to write.

It would ship with a few core plugins (ping, http, cert check, maybe snmp), but you could easily write a plugin to monitor anything else — for example, you could use the existing Python Minecraft library and write a plugin to monitor your Minecraft server. Or maybe even the ability to write plugins in any language, not just Python.

I’m not a developer and I’m opposed to vibe coding, so it’ll be slow going :)

rpcope1 19 hours ago

You've described this: https://munin-monitoring.org/
- justusthane 11 hours ago
  
  This looks fabulous, thank you!

bonobocop a day ago

Quite like Cloudprober for this tbh: https://cloudprober.org/docs/how-to/alerting/

Easy to configure, easy to extend with Go, and slots in to alerting.

whatever1 17 hours ago

Until the pc becomes unresponsive and needs a hard reset. In that case you are out of luck unless you have enterprise grade servers, or if you have some sort of smart plug that you can remotely power cycle

Spooky23 a day ago

I’m using node exporter to Prometheus and grafana. I also use uptime kuma, and send alerts via pushover.

It’s shockingly easy to setup. I have the monitoring stack living on a GCP host that I have setup for various things and have it connected via tailscale.

It actually paid for itself by alerting me to low voltage events via NUT. I probably would have lost some gear to poor electrical conditions.

JZL003 a day ago

I do kinda similar. I have a node express swrver which has lots of little async jobs, throw it all into a promise.all, and if they're all good, send 200, if not sent 500 and the failing jobs. Then free uptime monitors check every few hours and will email me if "the site goes down"=some error. Kinda like a multiplexer to stay within their free monitoring limit and easy to add more tests

frenchtoast8 a day ago

At work I use Datadog, but it's very expensive for a homelab: $15/mo per host (and for cost I prefer using multiple cheap servers than a single large one).

NewRelic and Grafana Cloud have pretty good free plan limits, but I'm paying for that in effort because I don't use either at work so it's not what I'm used to.

SteveNuts a day ago

The Datadog IoT agents are cheaper, but still probably more than you’d want to spend on a lab.
You also only get system metrics, no integrations - but most metrics and checks can be done remotely with a single dedicated agent

jamesholden a day ago

ok.. so your solution is using at minimum a $5/month service. Yikes, I'd prefer something like pushover before that. :/

faster a day ago

You can self-host ntfy.sh but then you need to find a place outside of your infra to host it.
- tony-allan a day ago
  
  I have AWS based services to monitor (servers/websites/etc) and use my homelab system to monitor resources that I am interested in.
  I just use a simple script that is run every 60 seconds and a list of resources to check.
tough a day ago

or a shell script

loloquwowndueo a day ago

Did he reinvent monit?

Even a quick Prometheus + alert manager setup with two docker containers is not difficult to manage - mine just works, I seldom have to touch it (mainly when I need to tweak the alert queries).

I use pushover for easy api-driven notifications to my phone, it’s a one-time $7 fee or so and it was money well spent.

atomicnumber3 a day ago

I have a similar setup, prometheus and grafana (alertmanager is a separate thing from the normal grafana setup, right? I'm not even using that), and I use discord webhooks for notifications to my phone (I just @ myself or use channel notification settings).

yanokwa 17 hours ago

https://mmonit.com is what I use for stuff like this.

fuzzfactor 9 hours ago

>I can’t remember why I decided that [ . . . ], but I remember thinking hard about it, and thinking I was quite the scholar for thinking hard about this. If you figure it out and I’m correct, then please let me know.

A whole chain of these can end up being the key to overcoming otherwise unsurmountable obstacles.

Which can be extremely unlikely for anybody else to be able to replicate in the future, especially if they don't even get the first step right, perhaps not even yourself :\

Something like that can be quite a moat for a technology developer ;)

KaiserPro a day ago

I understand your pain.

I used to have sensu, but it was a pain to keep updated (and didn't work that well on old rpis)

But what I did find was a good alternative was telegraph->some sort of time series (I still really like graphite, influxQL is utter horse shit, and prometheus's fucking pull models is bollocks)

Then I could create alert conditions on grafana. At least that was simple.

However the alerting on grafana moved from being "move the handle adjust a threshold, get a a configurable alert" to craft a query, get loads of unfilterable metadata as an alert.

its still good enough.

cyberpunk a day ago

Why is the pull model bollocks? I’ve been building monitoring for stuff since nagios and zabbix were the new hot tools; and I can’t really imagine preferring the oldschool ways vs the pretty much industry standard of promstack these days…
- mystifyingpoi a day ago
  
  Both models are totally fine, for their specific use cases.
- KaiserPro a day ago
  
  Zabbix is bollocks. so is nagios. Having remote root access to all your stuff is utter shite.
  Prometheus as a time series DB is great, I even like its QL. What I don't like is pull. Sure there is agent mode or telegraf/grafana agent. But the idea that I need to hold my state and wait for Prometheus to collect it is utterly stupid. The biggest annoyance is that I need to have a webserver somewhere, with a single god instance(s) that can reach out and touch it.
  Great if you have just one network, but a bollock ache if you have any kind of network isolation.
  This means that we are using influxdb and its shitty flux QL (I know we could upgrade, but thats hard)
  
  nullify88 15 hours ago
  
  AFAIK, you can now send OpenTelemetry directly to Prometheus. So effectively it supports both push and pull models.
  
  dengolius 8 hours ago
  
  AFAIK, OpenTelemetry has never been effective compared to Prometheus.
  
  cyberpunk a day ago
  
  eh in a standard three tier you’re usually okay to pull up and push down aren’t you? Run it in the lower network..
  Were all kubernetes these days so i guess i didn’t think about it a lot in recent years.

jauntywundrkind a day ago

There's an article-bias towards rejectionism, towards single shot adventures. "I didn't grok so and so and here's the shell scripts I wrote instead".

Especially for home cloud, home ops, home labs: that's great! That's awesome that you did for yourself, that you wrote up your experience.

But in general I feel like there's a huge missing middle of operations & sys-admin-ery that creates a distorted weird narrative. There's few people out there starting their journey with Prometheus blogging helpfully through it. There's few people mid way through their k8s work talking about their challenges and victories. The tales of just muddling through, of the perseverance, of looking for information, trying to find signal through the noise are few.

What we get a lot of is "this was too much for me so I wrote my own thing instead". Or, "we have been doing such and such for years and found such and such to shave 20% compute" or "we needed this capability so added Z to our k8s cluster like so". The journey is so often missing, we don't have stories of trying & learning. We have stories like this of making.

There's such a background of 'too complex' that I really worry leads us spiritually astray. I'm happy for articles like this, it's awesome to see ingenuity on display, but there's so many good amazing robust tools out there that seem to have lots of people happily or at least adequately using them, but it feels like the stories of turning back from the attempt, stories of eschewing the battle tested widely adopted software drive so much narrative, have so much more ink spilled over them.

Very thankful for Flix language putting Rich Hickey's principle of Simple isn't Easy first, for helping re-orient me by the axis of Hickey's old grand guidance. I feel like there's such a loud clambor generally for easy, for scripts you throw together, for the intimacy of tiny systems. And I admire a lot of these principles! But I also think there's a horrible backwardsness that doesn't help, that drives us away from more comprehensive capable integrative systems that can do amazing things, that are scalable both performance wise (as Prometheus certainly is) and organizationally (that other other people and other experts will also lastingly use and build from). The preselection for easy is attainable individually quickly, but real simple requires vastly more, requires so much more thought and planning and structure. https://www.infoq.com/presentations/Simple-Made-Easy/

It's so weird to find myself such a Cathedral-but-open-source fan today. Growing up the Bazaar model made such sense, had such virtue to it. And I still believe in the Bazaar, in the wide world teaming with different softwares. But I worry what lessons are most visible, worry what we pass along, worry about the proliferation of software discontent against the really good open source software that we do collaborate together on em masse. It feels like there's a massive self sabotage going on, that so many people are radicalized and sold a story of discontent against bigger more robust more popular open source software. I'd love to hear that view so much, but I want smaller individuals and voices also making a chorus of happy noise about how far they get how magical how powerful it is that we have so many amazing fantastic bigger open source projects that so scalably enable so much. https://en.m.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

cgriswald a day ago

This is sort of a ramble, so I apologize in advance.
I love the idea of writing up my ultimately-successful experiences of using open source software. I'm currently working on a big (for me anyway) project for my homelab involving a bunch of stuff I've either never or rarely done before. But... if I were to write about my experiences, a lot of it would be "I'm an idiot and I spent two hours with a valid but bad config because I misunderstood what the documentation was telling me about the syntax and yeah, I learned a bit more about reading the log file for X, but that was fundamentally pointless because it didn't really answer the question." I'd also have to keep track of what I did that didn't work, which adds a lot more work than just keeping track of what did work.
There's also a social aspect there where I don't want to necessarily reveal the precise nature of my idiocy to strangers over the internet. This might be the whole thing here for a lot of people. "Look at this awesome script I made because I'm a rugged and capable individualist" is probably an easier self-sell than "Despite my best efforts, I managed to scrounge together a system that works using pieces made by people smarter than me."
I think I might try. My main concern is whether it will ruin the fun. When I set up Prometheus, I had a lot of fun, even through the mistakes. But, would also trying to write about it make it less fun? Would other people even be interested in a haphazard floundering equivalent to reading about someone's experience with a homework assignment? Would I learn more? Would the frustrating moments be worse or would the process of thinking through things (because I am going to write about it) lead to my mistakes becoming apparent earlier? Will my ego survive people judging my process, conclusions, and writing? I don't know. Maybe it'll be fun to find out.

rpcope1 19 hours ago

I mean, you've sort of reinvented Munin. It certainly fits the bill here, and would stop the need to do a bunch of extra work.

dheera a day ago

Honest question, what the hell is everyone monitoring in a home lab that isn't already monitored?

I have an enterprise grade NAS, but if there's any kind of disk or RAID issue it beeps the shit out of me; I call that enough for home use.

I have a Unifi router, if there is a connection issue it fails over to LTE and I get a notification on my phone.

I have a UPS, if there is a power failure, my lights shut off, my NAS and workstation shuts down via NUT, and I can restart them remotely via VPN into my router and sending WOL packets.

Basically everything is already taken care of.

What the hell else do I need for a home? When I'm away I don't exactly have 10 million users trying to access my system, let alone 1.

hammyhavoc 18 hours ago

People care about failures with containers and scripts not uncommonly.
E.g., you run a service container that also needs Postgres, Redis, a reverse proxy, a Cloudflare Tunnel and perhaps sidecar worker containers too, like Authentik. People want to know where the problem is immediately without fucking around with 80+ containers.
TacticalCoder a day ago

[dead]

youatme a day ago

[dead]