Notifications about failing systemd services

In modern Linux distributions everything runs under systemd. If nginx, postgres, your backend, or the container runtime suddenly goes down — you need to know right away, not when users start calling.

The neatest solution is to integrate sending notifications directly into systemd via OnFailure= — without additional watchers or cron jobs.

Idea

Systemd can automatically start another unit when the main service transitions to failed. It’s enough to create a “universal” unit notifier once and attach it to all necessary services via a drop-in.

Universal notifier

/etc/systemd/system/notifly@.service:

[Unit]
Description=Notify about failed unit %i

[Service]
Type=oneshot
EnvironmentFile=/etc/notifly.env
ExecStart=/bin/bash -c '\
    STATUS=$(systemctl status %i --no-pager -n 20 | tail -n 20 | sed "s/\"/\\\\\\\"/g"); \
    /usr/local/bin/notifly-send \
        "🛑 Сервис %i упал на $(hostname -s)" \
        "$STATUS" 9'

Reload systemd:

sudo systemctl daemon-reload

Test sending manually:

sudo systemctl start notifly@nginx.service

Attaching to services

No need to edit the original unit files. Use a drop-in:

sudo systemctl edit nginx.service

In the editor that opens, add:

[Unit]
OnFailure=notifly@%n.service

Save, then:

sudo systemctl daemon-reload

Similarly, attach to all critical services:

for s in nginx postgresql redis docker your-backend; do
    sudo mkdir -p "/etc/systemd/system/${s}.service.d"
    cat <<EOF | sudo tee "/etc/systemd/system/${s}.service.d/notifly.conf" >/dev/null
[Unit]
OnFailure=notifly@%n.service
EOF
done
sudo systemctl daemon-reload

Test

Simulate a failure:

# Break the nginx config and try restarting
sudo nginx -t || true     # check the real config
sudo systemd-run --unit=test-fail.service /bin/false

Within a few seconds a message will arrive in Notifly containing the title and the last 20 lines of the failed service’s status.

Additionally: notification for automatic restarts

If a service has Restart=on-failure enabled, systemd may silently restart it. To see that as well, add OnFailure= and raise the counter:

[Service]
Restart=on-failure
RestartSec=5s
StartLimitBurst=3
StartLimitIntervalSec=120

[Unit]
OnFailure=notifly@%n.service
StartLimitAction=none

Then the notification will arrive after the third failure within 2 minutes — that is, when automatic recovery fails.

Windows: PowerShell + Service Control Manager

An equivalent for Windows servers: subscribe to Service Control Manager events in the System Event Log (Event ID 7031, 7034 — service failed/restarted). Uses Windows Event Trigger in Task Scheduler.

Notifier script

param([string]$ServiceName, [string]$EventId)
. C:\scripts\Notifly.ps1

$svc = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue
$status = if ($svc) { $svc.Status } else { "Unknown" }

# Last 20 entries from the System log for this service
$logs = Get-WinEvent -FilterHashtable @{LogName='System'; ProviderName='Service Control Manager'} `
        -MaxEvents 20 -ErrorAction SilentlyContinue |
        Where-Object { $_.Message -match $ServiceName } |
        Select-Object -First 5 |
        ForEach-Object { "$($_.TimeCreated.ToString('HH:mm:ss'))  $($_.Message)" } |
        Out-String

Send-Notifly `
    -Title "🛑 Сервис $ServiceName упал на $env:COMPUTERNAME" `
    -Message "Статус: $status`nEvent ID: $EventId`n`n$logs" `
    -Priority 9

Subscribing to events for critical services

$services = @("Spooler", "W3SVC", "MSSQLSERVER", "nginx")

foreach ($s in $services) {
    $xml = @"
<QueryList>
  <Query Id="0" Path="System">
    <Select Path="System">
      *[System[Provider[@Name='Service Control Manager'] and (EventID=7031 or EventID=7034)]]
      and *[EventData[Data='$s']]
    </Select>
  </Query>
</QueryList>
"@
    $Trigger = New-ScheduledTaskTrigger -AtStartup
    $Trigger.Subscription = $xml
    $Action = New-ScheduledTaskAction -Execute "powershell.exe" `
        -Argument "-NoProfile -ExecutionPolicy Bypass -File C:\scripts\Notifly-Service-Failed.ps1 -ServiceName $s -EventId 7031"
    Register-ScheduledTask -TaskName "Notifly Watch $s" -Trigger $Trigger -Action $Action `
        -Principal (New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest) `
        -Force
}

Alternative: scheduled polling

If subscribing to events doesn’t work for some reason — poll the services every minute:

. C:\scripts\Notifly.ps1
$watch = @("Spooler", "W3SVC", "MSSQLSERVER")
$state = "C:\ProgramData\Notifly\service-state.json"
$prev  = if (Test-Path $state) { Get-Content $state | ConvertFrom-Json } else { @{} }
$now   = @{}
foreach ($s in $watch) {
    $svc = Get-Service -Name $s -ErrorAction SilentlyContinue
    $now[$s] = if ($svc) { $svc.Status.ToString() } else { "Missing" }
    if ($prev.$s -and $prev.$s -ne $now[$s] -and $now[$s] -ne "Running") {
        Send-Notifly -Title "🛑 $s: $($now[$s]) на $env:COMPUTERNAME" `
                     -Message "Было: $($prev.$s)" -Priority 9
    }
}
$now | ConvertTo-Json | Set-Content $state

Benefits

Zero false alerts: the message arrives exactly when the unit actually went to failed.
No agents: systemd is already installed everywhere — no need for Zabbix, Prometheus, or Datadog just for a simple up/down check.
Context in your pocket: the message contains 20 log lines, often enough to understand the cause directly from your phone.

What to improve next

Add a “Restart” link — a separate action in Notifly with a deep link to the internal admin portal.
Implement different priority levels: for secondary cron units priority=4, for prod databases — priority=10.