de-p1st-monitor/TODO.md

269 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TODOs
## ~~digitemp temperature logging~~
~~Done through generic sensor_script logger.~~
## Public IP address logging
Logg the public IP address. Reuse `netcup-dns` Python functions.
## ~~Rewrite~~
* ~~easier configuration~~
* ~~easier read/write from/to csv~~
* ~~use classes & objects~~~~
* ~~create plots?~~
* ~~Don't send emit warning again, if during previous log a lower warning was emitted~~
* ~~Example:~~
* ~~log1: 30°C OK~~
* ~~log2: 40°C Warning sent~~
* ~~log3: 35°C Still above limit, but don't send warning again as value decreased~~
* ~~log4: 37°C Send another warning: The value increased since last logging~~
## Use Grafana to visualize metrics
One can use Prometheus + Grafana to collect and visualize server metrics.
> https://geekflare.com/best-open-source-monitoring-software/
> This list wont be complete without including two fantastic open-source solutions Prometheus and Grafana. Its DIY solution where you use Prometheus to scrape the metrics from server, OS, applications and use Grafana to visualize them.
As we do already collect logs, we should do some research on how to
import data into Grafana.
### Time series
* https://grafana.com/docs/grafana/latest/fundamentals/timeseries/#introduction-to-time-series
E.g. CPU and memory usage, sensor data.
* https://grafana.com/docs/grafana/latest/fundamentals/timeseries/#time-series-databases
A time series database (TSDB) is a database explicitly designed for time series data.
Some supported TSDBs are:
* Graphite
* InfluxDB
* Prometheus
### Installation
* https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/#alpine-image-recommended
* https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/#install-official-and-community-grafana-plugins
* https://grafana.com/grafana/plugins/marcusolsson-csv-datasource/?tab=installation
* https://grafana.github.io/grafana-csv-datasource/
* https://grafana.com/grafana/plugins/marcusolsson-json-datasource/?tab=installation
* https://grafana.github.io/grafana-json-datasource/
```shell
sudo docker run --rm \
-p 3000:3000 \
--name=grafana \
-e "GF_INSTALL_PLUGINS=marcusolsson-json-datasource,marcusolsson-csv-datasource" \
grafana/grafana-oss
```
TODO: test csv or json data import tools
## Netdata - Can be exported to Grafana
* https://github.com/netdata/netdata/blob/master/docs/getting-started/introduction.md
## Monit - An existing monitoring service
### General notes and links
* Monit is a widely used service for system monitoring.
* OPNsense uses Monit: https://docs.opnsense.org/manual/monit.html
* Short slideshow presentation: https://mmonit.com/monit/#slideshow
* https://wiki.ubuntuusers.de/Monit/
* Excellent configuration and usage summary in the Arch Linux Wiki: https://wiki.archlinux.org/title/Monit
* Examples
* https://mmonit.com/wiki/Monit/ConfigurationExamples
* One can use the returncode or stdout of an executed shell script
* https://mmonit.com/wiki/Monit/ConfigurationExamples#HDDHealth
```
check program HDD_Health with path "/usr/local/etc/monit/scripts/sdahealth.sh"
every 120 cycles
if content != "PASSED" then alert
# if status > 0 then alert
group health
```
* Documentation
* Event queue - Store events (notifications) if mail server is not reachable
* https://mmonit.com/monit/documentation/monit.html#Event-queue
```
set eventqueue basedir /var/monit
```
* https://mmonit.com/monit/documentation/monit.html#SPACE-USAGE-TEST
```
check filesystem rootfs with path /
if space usage > 90% then alert
```
* https://mmonit.com/monit/documentation/monit.html#PROGRAM-STATUS-TEST
```
check program myscript with path /usr/local/bin/myscript.sh
if status != 0 then alert
```
* https://mmonit.com/monit/documentation/monit.html#PROGRAM-OUTPUT-CONTENT-TEST
* https://mmonit.com/monit/documentation/monit.html#Link-upload-and-download-bytes
```
check network eth0 with interface eth0
if upload > 500 kB/s then alert
if total downloaded > 1 GB in last 2 hours then alert
if total downloaded > 10 GB in last day then alert
```
* https://mmonit.com/monit/documentation/monit.html#MANAGE-YOUR-MONIT-INSTANCES
### Monitoring all your monit instances
* Monit itself does only monitor the current system
* Multi-server monitoring is a paid extra service called M/Monit :/
* But there are other open source services for this
* https://github.com/monmon-io/monmon#why-did-you-create-monmon
### Setup
Install and start:
```shell
sudo pacman -S --needed monit lm_sensors smartmontools
sudo systemctl start monit
sudo systemctl status monit | grep 'Active: active (running)'
```
Print default configuration:
```shell
sudo cat /etc/monitrc | grep -v '^#'
#=> set daemon 30
#=> - A cycle is 30 seconds long.
#=> set log syslog
#=> - We will overwrite this config value later on.
#=> set httpd port 2812
#=> - Only listen on localhost with username admin and pwd monit.
```
Include `monit.d`:
```shell
sudo mkdir -p /etc/monit.d/
! sudo cat /etc/monitrc | grep -q '^include' && echo 'include /etc/monit.d/*' | sudo tee -a /etc/monitrc
```
Log to file:
```shell
sudo install -m700 /dev/stdin /etc/monit.d/log <<< 'set log /var/log/monit.log'
sudo systemctl restart monit
# tail -f /var/log/monit.log
```
System:
```shell
sudo install -m700 /dev/stdin /etc/monit.d/system <<< 'check system $HOST
if filedescriptors >= 80% then alert
if loadavg (5min) > 2 for 4 cycles then alert
if memory usage > 75% for 4 cycles then alert
if swap usage > 50% for 4 cycles then alert'
sudo systemctl restart monit
```
Filesystem:
```shell
sudo install -m700 /dev/stdin /etc/monit.d/fs <<< 'check filesystem rootfs with path /
if space usage > 80% then alert'
sudo systemctl restart monit
```
SSL options:
* https://mmonit.com/monit/documentation/monit.html#SSL-OPTIONS
```shell
sudo install -m700 /dev/stdin /etc/monit.d/ssl <<< '# Enable certificate verification for all SSL connections
set ssl options {
verify: enable
}'
sudo systemctl restart monit
```
Mailserver, alerts and eventqueue:
* https://mmonit.com/monit/documentation/monit.html#Setting-a-mail-server-for-alert-delivery
* https://mmonit.com/monit/documentation/monit.html#Setting-an-error-reminder
* https://mmonit.com/monit/documentation/monit.html#Event-queue
* If no mail server is available, Monit can queue events in the local file-system for retry until the mail server recovers.
* By default, the queue is disabled and if the alert handler fails, Monit will simply drop the alert message.
```shell
sudo install -m700 /dev/stdin /etc/monit.d/mail <<< 'set mailserver smtp.mail.de
port 465
username "langbein@mail.de"
password "qiXF6cUgfvSVqd0pAoFTqZEHIcUKzc3n"
using SSL
with timeout 20 seconds
set mail-format {
from: langbein@mail.de
subject: $SERVICE - $EVENT at $DATE
message: Monit $ACTION $SERVICE at $DATE on $HOST: $DESCRIPTION.
}
set alert daniel@systemli.org with reminder on 10 cycles
set eventqueue basedir /var/monit'
sudo systemctl restart monit
sudo monit -v | grep 'Mail'
```
Test alert:
* https://wiki.ubuntuusers.de/Monit/#E-Mail-Benachrichtigungen-testen
* It is enough to restart monit. It will send an email that it's state has changed (stopped/started).
* But if desired, one can also create a test for a non-existing file:
```shell
sudo install -m700 /dev/stdin /etc/monit.d/alerttest <<< 'check file alerttest with path /.nonexistent.file'
sudo systemctl restart monit
```
Example script - run a speedtest:
```shell
sudo pacman -S --needed speedtest-cli
sudo install -m700 /dev/stdin /etc/monit.d/speedtest <<< 'check program speedtest with path /usr/bin/speedtest-cli
every 120 cycles
if status != 0 then alert'
sudo systemctl restart monit
```
Check config syntax:
```shell
sudo monit -t
```
################## TODOS ##########################
* See Firefox bookmark folder 20230219_monit.
* Disk health
* BTRFS balance
* Save disk usage and temperatures to CSV log file
* e.g. by using `check program check-and-log-temp.sh` monit configuration
* Or: do checks by monit and every couple minutes run `check program log-system-info.sh`
### Monit behind Nginx
TODO: Nginx reverse proxy with basic authentication.