Notifications about failing systemd services
In modern Linux distributions everything runs under systemd. If nginx, postgres, your backend, or the container runtime suddenly goes down — you need to know right away, not when users start calling.
The neatest solution is to integrate sending notifications directly into systemd via
OnFailure= — without additional watchers or cron jobs.
Systemd can automatically start another unit when the main service
transitions to failed. It’s enough to create a “universal”
unit notifier once and attach it to all necessary services via a drop-in.
Universal notifier
Section titled “Universal notifier”/etc/systemd/system/notifly@.service:
[Unit]Description=Notify about failed unit %i
[Service]Type=oneshotEnvironmentFile=/etc/notifly.envExecStart=/bin/bash -c '\ STATUS=$(systemctl status %i --no-pager -n 20 | tail -n 20 | sed "s/\"/\\\\\\\"/g"); \ /usr/local/bin/notifly-send \ "🛑 Сервис %i упал на $(hostname -s)" \ "$STATUS" 9'Reload systemd:
sudo systemctl daemon-reloadTest sending manually:
sudo systemctl start notifly@nginx.serviceAttaching to services
Section titled “Attaching to services”No need to edit the original unit files. Use a drop-in:
sudo systemctl edit nginx.serviceIn the editor that opens, add:
[Unit]OnFailure=notifly@%n.serviceSave, then:
sudo systemctl daemon-reloadSimilarly, attach to all critical services:
for s in nginx postgresql redis docker your-backend; do sudo mkdir -p "/etc/systemd/system/${s}.service.d" cat <<EOF | sudo tee "/etc/systemd/system/${s}.service.d/notifly.conf" >/dev/null[Unit]OnFailure=notifly@%n.serviceEOFdonesudo systemctl daemon-reloadSimulate a failure:
# Break the nginx config and try restartingsudo nginx -t || true # check the real configsudo systemd-run --unit=test-fail.service /bin/falseWithin a few seconds a message will arrive in Notifly containing the title and the last 20 lines of the failed service’s status.
Additionally: notification for automatic restarts
Section titled “Additionally: notification for automatic restarts”If a service has Restart=on-failure enabled, systemd may silently restart it.
To see that as well, add OnFailure= and raise the counter:
[Service]Restart=on-failureRestartSec=5sStartLimitBurst=3StartLimitIntervalSec=120
[Unit]OnFailure=notifly@%n.serviceStartLimitAction=noneThen the notification will arrive after the third failure within 2 minutes — that is, when automatic recovery fails.
Windows: PowerShell + Service Control Manager
Section titled “Windows: PowerShell + Service Control Manager”An equivalent for Windows servers: subscribe to Service Control Manager events in the System Event Log (Event ID 7031, 7034 — service failed/restarted). Uses Windows Event Trigger in Task Scheduler.
Notifier script
Section titled “Notifier script”param([string]$ServiceName, [string]$EventId). C:\scripts\Notifly.ps1
$svc = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue$status = if ($svc) { $svc.Status } else { "Unknown" }
# Last 20 entries from the System log for this service$logs = Get-WinEvent -FilterHashtable @{LogName='System'; ProviderName='Service Control Manager'} ` -MaxEvents 20 -ErrorAction SilentlyContinue | Where-Object { $_.Message -match $ServiceName } | Select-Object -First 5 | ForEach-Object { "$($_.TimeCreated.ToString('HH:mm:ss')) $($_.Message)" } | Out-String
Send-Notifly ` -Title "🛑 Сервис $ServiceName упал на $env:COMPUTERNAME" ` -Message "Статус: $status`nEvent ID: $EventId`n`n$logs" ` -Priority 9Subscribing to events for critical services
Section titled “Subscribing to events for critical services”$services = @("Spooler", "W3SVC", "MSSQLSERVER", "nginx")
foreach ($s in $services) { $xml = @"<QueryList> <Query Id="0" Path="System"> <Select Path="System"> *[System[Provider[@Name='Service Control Manager'] and (EventID=7031 or EventID=7034)]] and *[EventData[Data='$s']] </Select> </Query></QueryList>"@ $Trigger = New-ScheduledTaskTrigger -AtStartup $Trigger.Subscription = $xml $Action = New-ScheduledTaskAction -Execute "powershell.exe" ` -Argument "-NoProfile -ExecutionPolicy Bypass -File C:\scripts\Notifly-Service-Failed.ps1 -ServiceName $s -EventId 7031" Register-ScheduledTask -TaskName "Notifly Watch $s" -Trigger $Trigger -Action $Action ` -Principal (New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest) ` -Force}Alternative: scheduled polling
Section titled “Alternative: scheduled polling”If subscribing to events doesn’t work for some reason — poll the services every minute:
. C:\scripts\Notifly.ps1$watch = @("Spooler", "W3SVC", "MSSQLSERVER")$state = "C:\ProgramData\Notifly\service-state.json"$prev = if (Test-Path $state) { Get-Content $state | ConvertFrom-Json } else { @{} }$now = @{}foreach ($s in $watch) { $svc = Get-Service -Name $s -ErrorAction SilentlyContinue $now[$s] = if ($svc) { $svc.Status.ToString() } else { "Missing" } if ($prev.$s -and $prev.$s -ne $now[$s] -and $now[$s] -ne "Running") { Send-Notifly -Title "🛑 $s: $($now[$s]) на $env:COMPUTERNAME" ` -Message "Было: $($prev.$s)" -Priority 9 }}$now | ConvertTo-Json | Set-Content $stateBenefits
Section titled “Benefits”- Zero false alerts: the message arrives exactly when the unit actually
went to
failed. - No agents: systemd is already installed everywhere — no need for Zabbix, Prometheus, or Datadog just for a simple up/down check.
- Context in your pocket: the message contains 20 log lines, often enough to understand the cause directly from your phone.
What to improve next
Section titled “What to improve next”- Add a “Restart” link — a separate action in Notifly with a deep link to the internal admin portal.
- Implement different priority levels: for secondary cron units
priority=4, for prod databases —priority=10.