High-load and OOM notifications
The server seems alive — SSH opens, but everything runs 30 times slower. Often the cause is a forgotten process that has eaten CPU/memory. If a notification arrives at the moment of the spike, investigation takes minutes instead of hours.
What we monitor
Section titled “What we monitor”- average load (
load average) > N × number of CPUs; - RAM usage > 90%;
- swap usage > 50%;
- OOM killer events in
dmesg/journalctl.
Script
Section titled “Script”/usr/local/bin/notifly-load-check:
#!/usr/bin/env bashset -euset -a; source /etc/notifly.env; set +a
HOST=$(hostname -s)CPUS=$(nproc)LOAD1=$(awk '{print $1}' /proc/loadavg)LOAD_PCT=$(awk -v l="$LOAD1" -v c="$CPUS" 'BEGIN{printf "%d", l/c*100}')
# Memory (without cache)read -r _ MTOTAL MUSED _ <<<"$(free -m | awk '/^Mem:/{print $1, $2, $3, $4}')"MEM_PCT=$(( MUSED * 100 / MTOTAL ))
read -r _ STOTAL SUSED _ <<<"$(free -m | awk '/^Swap:/{print $1, $2, $3, $4}')"SWAP_PCT=0[ "$STOTAL" -gt 0 ] && SWAP_PCT=$(( SUSED * 100 / STOTAL ))
ALERTS=()
[ "$LOAD_PCT" -ge 200 ] && ALERTS+=("CPU load ${LOAD_PCT}% (load1=${LOAD1}, ${CPUS} CPU)")[ "$MEM_PCT" -ge 90 ] && ALERTS+=("RAM ${MEM_PCT}% занято (${MUSED}/${MTOTAL} МБ)")[ "$SWAP_PCT" -ge 50 ] && ALERTS+=("Swap ${SWAP_PCT}% (${SUSED}/${STOTAL} МБ)")
if [ "${#ALERTS[@]}" -gt 0 ]; then TOP=$(ps -eo pid,user,pcpu,pmem,comm --sort=-pcpu | head -6) /usr/local/bin/notifly-send \ "🔥 Нагрузка на $HOST" \ "$(printf '%s\n' "${ALERTS[@]}")
Топ процессов:$TOP" 8fisudo chmod +x /usr/local/bin/notifly-load-checkRun every 5 minutes
Section titled “Run every 5 minutes”*/5 * * * * root /usr/local/bin/notifly-load-checkTo avoid getting identical messages every 5 minutes, add a flag file:
FLAG=/tmp/notifly-load-flagNOW=$(date +%s)LAST=$(stat -c %Y "$FLAG" 2>/dev/null || echo 0)[ $((NOW - LAST)) -lt 1800 ] && exit 0touch "$FLAG"Then repeats will be no more often than once every 30 minutes.
OOM killer
Section titled “OOM killer”When the kernel kills a process due to out-of-memory, it logs it to dmesg.
Let’s run a system watcher:
/usr/local/bin/notifly-oom:
#!/usr/bin/env bashset -euset -a; source /etc/notifly.env; set +a
journalctl -kf -o cat --since now | \while read -r line; do if echo "$line" | grep -qiE "killed process|out of memory"; then /usr/local/bin/notifly-send \ "💀 OOM killer на $(hostname -s)" \ "$line" 10 || true fidone/etc/systemd/system/notifly-oom.service:
[Unit]Description=Notify about OOM killsAfter=network-online.target
[Service]ExecStart=/usr/local/bin/notifly-oomRestart=alwaysRestartSec=10s
[Install]WantedBy=multi-user.targetsudo chmod +x /usr/local/bin/notifly-oomsudo systemctl daemon-reloadsudo systemctl enable --now notifly-oomWindows: PowerShell + Task Scheduler
Section titled “Windows: PowerShell + Task Scheduler”A Windows equivalent: monitors CPU/RAM load and the size of the page file.
Uses the shared function Send-Notifly.
. C:\scripts\Notifly.ps1
$Host = $env:COMPUTERNAME$FlagFile = "$env:TEMP\notifly-load.flag"
# CPU over the last ~3 seconds$cpu = (Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 1 -MaxSamples 3). CounterSamples.CookedValue | Measure-Object -Average | Select-Object -Expand Average$cpu = [int]$cpu
$os = Get-CimInstance Win32_OperatingSystem$memUsedPct = [int]((($os.TotalVisibleMemorySize - $os.FreePhysicalMemory) / $os.TotalVisibleMemorySize) * 100)
$pageFile = Get-CimInstance Win32_PageFileUsage -ErrorAction SilentlyContinue$pageUsedPct = if ($pageFile -and $pageFile.AllocatedBaseSize -gt 0) { [int](($pageFile.CurrentUsage / $pageFile.AllocatedBaseSize) * 100)} else { 0 }
$alerts = @()if ($cpu -ge 90) { $alerts += "CPU: $cpu%" }if ($memUsedPct -ge 90) { $alerts += "RAM: $memUsedPct%" }if ($pageUsedPct -ge 50) { $alerts += "Page file: $pageUsedPct%" }
if ($alerts.Count -gt 0) { # Debounce for 30 minutes if (Test-Path $FlagFile) { $age = (Get-Date) - (Get-Item $FlagFile).LastWriteTime if ($age.TotalMinutes -lt 30) { exit 0 } } New-Item -ItemType File -Path $FlagFile -Force | Out-Null
$top = Get-Process | Sort-Object CPU -Descending | Select-Object -First 5 | Format-Table -AutoSize ProcessName, Id, @{N='CPU(s)';E={[math]::Round($_.CPU,1)}}, ` @{N='RAM(MB)';E={[int]($_.WorkingSet64/1MB)}} | Out-String
Send-Notifly -Title "🔥 Нагрузка на $Host" ` -Message ($alerts -join ", ") + "`n`nТоп процессов:`n$top" ` -Priority 8}Register in Task Scheduler (every 5 minutes):
$Action = New-ScheduledTaskAction -Execute "powershell.exe" ` -Argument "-NoProfile -ExecutionPolicy Bypass -File C:\scripts\Notifly-Load-Check.ps1"$Trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) ` -RepetitionInterval (New-TimeSpan -Minutes 5)$Princ = New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel HighestRegister-ScheduledTask -TaskName "Notifly Load Check" ` -Action $Action -Trigger $Trigger -Principal $PrincWindows equivalent of OOM killer
Section titled “Windows equivalent of OOM killer”Windows doesn’t have an OOM killer, but hard memory failures are logged in the Application Event Log
with source Resource-Exhaustion-Detector (Event ID 2004). Let’s subscribe to that log:
$Trigger = New-ScheduledTaskTrigger -AtStartup$Trigger.Subscription = @"<QueryList> <Query Id="0" Path="Application"> <Select Path="Application"> *[System[Provider[@Name='Microsoft-Windows-Resource-Exhaustion-Detector'] and EventID=2004]] </Select> </Query></QueryList>"@$Action = New-ScheduledTaskAction -Execute "powershell.exe" ` -Argument "-NoProfile -ExecutionPolicy Bypass -Command `". C:\scripts\Notifly.ps1; Send-Notifly -Title '💀 Критическая нехватка памяти на $env:COMPUTERNAME' -Message 'Resource-Exhaustion-Detector зафиксировал истощение RAM' -Priority 10`""Register-ScheduledTask -TaskName "Notifly Memory Exhaustion" -Trigger $Trigger -Action $Action ` -Principal (New-ScheduledTaskPrincipal -UserId "SYSTEM" -LogonType ServiceAccount -RunLevel Highest)Benefits
Section titled “Benefits”- You can see the moment of the spike. The top processes in the message often immediately point to the culprit.
- Catching OOMs is the most valuable. These events quickly roll off dmesg, and without an alert they’re discovered days later.
- Cheap “liveness sensor”: if a busy production server is quiet for days, you can check that monitoring itself is alive.
What to improve next
Section titled “What to improve next”- Use
pidstator pressure stall information (PSI) to more accurately assess real backpressure. - Compare current load with the weekly average — alert only on deviation.
- Link to notification about service outages — usually after an OOM a specific service crashes.