Skip to content

High-load and OOM notifications

The server seems alive — SSH opens, but everything runs 30 times slower. Often the cause is a forgotten process that has eaten CPU/memory. If a notification arrives at the moment of the spike, investigation takes minutes instead of hours.

  • average load (load average) > N × number of CPUs;
  • RAM usage > 90%;
  • swap usage > 50%;
  • OOM killer events in dmesg/journalctl.

/usr/local/bin/notifly-load-check:

#!/usr/bin/env bash
set -eu
set -a; source /etc/notifly.env; set +a
HOST=$(hostname -s)
CPUS=$(nproc)
LOAD1=$(awk '{print $1}' /proc/loadavg)
LOAD_PCT=$(awk -v l="$LOAD1" -v c="$CPUS" 'BEGIN{printf "%d", l/c*100}')
# Memory (without cache)
read -r _ MTOTAL MUSED _ <<<"$(free -m | awk '/^Mem:/{print $1, $2, $3, $4}')"
MEM_PCT=$(( MUSED * 100 / MTOTAL ))
read -r _ STOTAL SUSED _ <<<"$(free -m | awk '/^Swap:/{print $1, $2, $3, $4}')"
SWAP_PCT=0
[ "$STOTAL" -gt 0 ] && SWAP_PCT=$(( SUSED * 100 / STOTAL ))
ALERTS=()
[ "$LOAD_PCT" -ge 200 ] && ALERTS+=("CPU load ${LOAD_PCT}% (load1=${LOAD1}, ${CPUS} CPU)")
[ "$MEM_PCT" -ge 90 ] && ALERTS+=("RAM ${MEM_PCT}% занято (${MUSED}/${MTOTAL} МБ)")
[ "$SWAP_PCT" -ge 50 ] && ALERTS+=("Swap ${SWAP_PCT}% (${SUSED}/${STOTAL} МБ)")
if [ "${#ALERTS[@]}" -gt 0 ]; then
TOP=$(ps -eo pid,user,pcpu,pmem,comm --sort=-pcpu | head -6)
/usr/local/bin/notifly-send \
"🔥 Нагрузка на $HOST" \
"$(printf '%s\n' "${ALERTS[@]}")
Топ процессов:
$TOP" 8
fi
Окно терминала
sudo chmod +x /usr/local/bin/notifly-load-check
*/5 * * * * root /usr/local/bin/notifly-load-check

To avoid getting identical messages every 5 minutes, add a flag file:

Окно терминала
FLAG=/tmp/notifly-load-flag
NOW=$(date +%s)
LAST=$(stat -c %Y "$FLAG" 2>/dev/null || echo 0)
[ $((NOW - LAST)) -lt 1800 ] && exit 0
touch "$FLAG"

Then repeats will be no more often than once every 30 minutes.

When the kernel kills a process due to out-of-memory, it logs it to dmesg. Let’s run a system watcher:

/usr/local/bin/notifly-oom:

#!/usr/bin/env bash
set -eu
set -a; source /etc/notifly.env; set +a
journalctl -kf -o cat --since now | \
while read -r line; do
if echo "$line" | grep -qiE "killed process|out of memory"; then
/usr/local/bin/notifly-send \
"💀 OOM killer на $(hostname -s)" \
"$line" 10 || true
fi
done

/etc/systemd/system/notifly-oom.service:

[Unit]
Description=Notify about OOM kills
After=network-online.target
[Service]
ExecStart=/usr/local/bin/notifly-oom
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
Окно терминала
sudo chmod +x /usr/local/bin/notifly-oom
sudo systemctl daemon-reload
sudo systemctl enable --now notifly-oom

A Windows equivalent: monitors CPU/RAM load and the size of the page file. Uses the shared function Send-Notifly.

C:\scripts\Notifly-Load-Check.ps1
. C:\scripts\Notifly.ps1
$Host = $env:COMPUTERNAME
$FlagFile = "$env:TEMP\notifly-load.flag"
# CPU over the last ~3 seconds
$cpu = (Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 1 -MaxSamples 3).
CounterSamples.CookedValue | Measure-Object -Average | Select-Object -Expand Average
$cpu = [int]$cpu
$os = Get-CimInstance Win32_OperatingSystem
$memUsedPct = [int]((($os.TotalVisibleMemorySize - $os.FreePhysicalMemory) / $os.TotalVisibleMemorySize) * 100)
$pageFile = Get-CimInstance Win32_PageFileUsage -ErrorAction SilentlyContinue
$pageUsedPct = if ($pageFile -and $pageFile.AllocatedBaseSize -gt 0) {
[int](($pageFile.CurrentUsage / $pageFile.AllocatedBaseSize) * 100)
} else { 0 }
$alerts = @()
if ($cpu -ge 90) { $alerts += "CPU: $cpu%" }
if ($memUsedPct -ge 90) { $alerts += "RAM: $memUsedPct%" }
if ($pageUsedPct -ge 50) { $alerts += "Page file: $pageUsedPct%" }
if ($alerts.Count -gt 0) {
# Debounce for 30 minutes
if (Test-Path $FlagFile) {
$age = (Get-Date) - (Get-Item $FlagFile).LastWriteTime
if ($age.TotalMinutes -lt 30) { exit 0 }
}
New-Item -ItemType File -Path $FlagFile -Force | Out-Null
$top = Get-Process | Sort-Object CPU -Descending | Select-Object -First 5 |
Format-Table -AutoSize ProcessName, Id, @{N='CPU(s)';E={[math]::Round($_.CPU,1)}}, `
@{N='RAM(MB)';E={[int]($_.WorkingSet64/1MB)}} | Out-String
Send-Notifly -Title "🔥 Нагрузка на $Host" `
-Message ($alerts -join ", ") + "`n`nТоп процессов:`n$top" `
-Priority 8
}

Register in Task Scheduler (every 5 minutes):

Окно терминала
$Action = New-ScheduledTaskAction -Execute "powershell.exe" `
-Argument "-NoProfile -ExecutionPolicy Bypass -File C:\scripts\Notifly-Load-Check.ps1"
$Trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) `
-RepetitionInterval (New-TimeSpan -Minutes 5)
$Princ = New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest
Register-ScheduledTask -TaskName "Notifly Load Check" `
-Action $Action -Trigger $Trigger -Principal $Princ

Windows doesn’t have an OOM killer, but hard memory failures are logged in the Application Event Log with source Resource-Exhaustion-Detector (Event ID 2004). Let’s subscribe to that log:

C:\scripts\Register-NotiflyMemoryAlert.ps1
$Trigger = New-ScheduledTaskTrigger -AtStartup
$Trigger.Subscription = @"
<QueryList>
<Query Id="0" Path="Application">
<Select Path="Application">
*[System[Provider[@Name='Microsoft-Windows-Resource-Exhaustion-Detector'] and EventID=2004]]
</Select>
</Query>
</QueryList>
"@
$Action = New-ScheduledTaskAction -Execute "powershell.exe" `
-Argument "-NoProfile -ExecutionPolicy Bypass -Command `". C:\scripts\Notifly.ps1; Send-Notifly -Title '💀 Критическая нехватка памяти на $env:COMPUTERNAME' -Message 'Resource-Exhaustion-Detector зафиксировал истощение RAM' -Priority 10`""
Register-ScheduledTask -TaskName "Notifly Memory Exhaustion" -Trigger $Trigger -Action $Action `
-Principal (New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest)
  • You can see the moment of the spike. The top processes in the message often immediately point to the culprit.
  • Catching OOMs is the most valuable. These events quickly roll off dmesg, and without an alert they’re discovered days later.
  • Cheap “liveness sensor”: if a busy production server is quiet for days, you can check that monitoring itself is alive.
  • Use pidstat or pressure stall information (PSI) to more accurately assess real backpressure.
  • Compare current load with the weekly average — alert only on deviation.
  • Link to notification about service outages — usually after an OOM a specific service crashes.