High-load and OOM notifications

The server seems alive — SSH opens, but everything runs 30 times slower. Often the cause is a forgotten process that has eaten CPU/memory. If a notification arrives at the moment of the spike, investigation takes minutes instead of hours.

What we monitor

average load (load average) > N × number of CPUs;
RAM usage > 90%;
swap usage > 50%;
OOM killer events in dmesg/journalctl.

Script

/usr/local/bin/notifly-load-check:

#!/usr/bin/env bash
set -eu
set -a; source /etc/notifly.env; set +a

HOST=$(hostname -s)
CPUS=$(nproc)
LOAD1=$(awk '{print $1}' /proc/loadavg)
LOAD_PCT=$(awk -v l="$LOAD1" -v c="$CPUS" 'BEGIN{printf "%d", l/c*100}')

# Memory (without cache)
read -r _ MTOTAL MUSED _ <<<"$(free -m | awk '/^Mem:/{print $1, $2, $3, $4}')"
MEM_PCT=$(( MUSED * 100 / MTOTAL ))

read -r _ STOTAL SUSED _ <<<"$(free -m | awk '/^Swap:/{print $1, $2, $3, $4}')"
SWAP_PCT=0
[ "$STOTAL" -gt 0 ] && SWAP_PCT=$(( SUSED * 100 / STOTAL ))

ALERTS=()

[ "$LOAD_PCT" -ge 200 ] && ALERTS+=("CPU load ${LOAD_PCT}% (load1=${LOAD1}, ${CPUS} CPU)")
[ "$MEM_PCT"  -ge  90 ] && ALERTS+=("RAM ${MEM_PCT}% занято (${MUSED}/${MTOTAL} МБ)")
[ "$SWAP_PCT" -ge  50 ] && ALERTS+=("Swap ${SWAP_PCT}% (${SUSED}/${STOTAL} МБ)")

if [ "${#ALERTS[@]}" -gt 0 ]; then
    TOP=$(ps -eo pid,user,pcpu,pmem,comm --sort=-pcpu | head -6)
    /usr/local/bin/notifly-send \
        "🔥 Нагрузка на $HOST" \
        "$(printf '%s\n' "${ALERTS[@]}")

Топ процессов:
$TOP" 8
fi

sudo chmod +x /usr/local/bin/notifly-load-check

Run every 5 minutes

*/5 * * * * root /usr/local/bin/notifly-load-check

To avoid getting identical messages every 5 minutes, add a flag file:

FLAG=/tmp/notifly-load-flag
NOW=$(date +%s)
LAST=$(stat -c %Y "$FLAG" 2>/dev/null || echo 0)
[ $((NOW - LAST)) -lt 1800 ] && exit 0
touch "$FLAG"

Then repeats will be no more often than once every 30 minutes.

OOM killer

When the kernel kills a process due to out-of-memory, it logs it to dmesg. Let’s run a system watcher:

/usr/local/bin/notifly-oom:

#!/usr/bin/env bash
set -eu
set -a; source /etc/notifly.env; set +a

journalctl -kf -o cat --since now | \
while read -r line; do
    if echo "$line" | grep -qiE "killed process|out of memory"; then
        /usr/local/bin/notifly-send \
            "💀 OOM killer на $(hostname -s)" \
            "$line" 10 || true
    fi
done

/etc/systemd/system/notifly-oom.service:

[Unit]
Description=Notify about OOM kills
After=network-online.target

[Service]
ExecStart=/usr/local/bin/notifly-oom
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

sudo chmod +x /usr/local/bin/notifly-oom
sudo systemctl daemon-reload
sudo systemctl enable --now notifly-oom

Windows: PowerShell + Task Scheduler

A Windows equivalent: monitors CPU/RAM load and the size of the page file. Uses the shared function Send-Notifly.

. C:\scripts\Notifly.ps1

$Host = $env:COMPUTERNAME
$FlagFile = "$env:TEMP\notifly-load.flag"

# CPU over the last ~3 seconds
$cpu = (Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 1 -MaxSamples 3).
        CounterSamples.CookedValue | Measure-Object -Average | Select-Object -Expand Average
$cpu = [int]$cpu

$os = Get-CimInstance Win32_OperatingSystem
$memUsedPct = [int]((($os.TotalVisibleMemorySize - $os.FreePhysicalMemory) / $os.TotalVisibleMemorySize) * 100)

$pageFile = Get-CimInstance Win32_PageFileUsage -ErrorAction SilentlyContinue
$pageUsedPct = if ($pageFile -and $pageFile.AllocatedBaseSize -gt 0) {
    [int](($pageFile.CurrentUsage / $pageFile.AllocatedBaseSize) * 100)
} else { 0 }

$alerts = @()
if ($cpu -ge 90)         { $alerts += "CPU: $cpu%" }
if ($memUsedPct -ge 90)  { $alerts += "RAM: $memUsedPct%" }
if ($pageUsedPct -ge 50) { $alerts += "Page file: $pageUsedPct%" }

if ($alerts.Count -gt 0) {
    # Debounce for 30 minutes
    if (Test-Path $FlagFile) {
        $age = (Get-Date) - (Get-Item $FlagFile).LastWriteTime
        if ($age.TotalMinutes -lt 30) { exit 0 }
    }
    New-Item -ItemType File -Path $FlagFile -Force | Out-Null

    $top = Get-Process | Sort-Object CPU -Descending | Select-Object -First 5 |
           Format-Table -AutoSize ProcessName, Id, @{N='CPU(s)';E={[math]::Round($_.CPU,1)}}, `
               @{N='RAM(MB)';E={[int]($_.WorkingSet64/1MB)}} | Out-String

    Send-Notifly -Title "🔥 Нагрузка на $Host" `
                 -Message ($alerts -join ", ") + "`n`nТоп процессов:`n$top" `
                 -Priority 8
}

$Action  = New-ScheduledTaskAction -Execute "powershell.exe" `
    -Argument "-NoProfile -ExecutionPolicy Bypass -File C:\scripts\Notifly-Load-Check.ps1"
$Trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) `
    -RepetitionInterval (New-TimeSpan -Minutes 5)
$Princ   = New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest
Register-ScheduledTask -TaskName "Notifly Load Check" `
    -Action $Action -Trigger $Trigger -Principal $Princ

Windows equivalent of OOM killer

Windows doesn’t have an OOM killer, but hard memory failures are logged in the Application Event Log with source Resource-Exhaustion-Detector (Event ID 2004). Let’s subscribe to that log:

$Trigger = New-ScheduledTaskTrigger -AtStartup
$Trigger.Subscription = @"
<QueryList>
  <Query Id="0" Path="Application">
    <Select Path="Application">
      *[System[Provider[@Name='Microsoft-Windows-Resource-Exhaustion-Detector'] and EventID=2004]]
    </Select>
  </Query>
</QueryList>
"@
$Action  = New-ScheduledTaskAction -Execute "powershell.exe" `
    -Argument "-NoProfile -ExecutionPolicy Bypass -Command `". C:\scripts\Notifly.ps1; Send-Notifly -Title '💀 Критическая нехватка памяти на $env:COMPUTERNAME' -Message 'Resource-Exhaustion-Detector зафиксировал истощение RAM' -Priority 10`""
Register-ScheduledTask -TaskName "Notifly Memory Exhaustion" -Trigger $Trigger -Action $Action `
    -Principal (New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest)

Benefits

You can see the moment of the spike. The top processes in the message often immediately point to the culprit.
Catching OOMs is the most valuable. These events quickly roll off dmesg, and without an alert they’re discovered days later.
Cheap “liveness sensor”: if a busy production server is quiet for days, you can check that monitoring itself is alive.

What to improve next

Use pidstat or pressure stall information (PSI) to more accurately assess real backpressure.
Compare current load with the weekly average — alert only on deviation.
Link to notification about service outages — usually after an OOM a specific service crashes.