A server you can’t see is a server you can’t trust. Most incidents — a memory leak, a disk filling up, a process pegging the CPU — are visible in the data well before they cause an outage. The question is whether you’re looking. This guide covers the tools you need, how to read what they’re telling you, and how to set up monitoring that doesn’t require you to check manually.
:::note[TL;DR]
- CPU:
top,htop,mpstat— look at load average relative to CPU count - Memory:
free -h,vmstat— watch for swap usage growing over time - Disk:
df -h,du -sh,iotop— full disks kill servers silently - Persistent monitoring: Netdata or Prometheus + Grafana for dashboards and alerts
- One-liner health check:
uptime && free -h && df -h:::
Prerequisites
- Linux server (Ubuntu 22.04/24.04 or Debian 12)
- SSH access
- Most tools here are pre-installed; some require
apt install
How do you check CPU usage?
Quick snapshot with top:
top
Key fields to read:
load average: 0.5, 1.2, 0.8— CPU load over 1, 5, and 15 minutes. A healthy server has a load average below its CPU core count. A 4-core server with a load of 6.0 is overloaded.%Cpu(s): 25.0 us, 5.0 sy, 0.0 ni, 68.0 id— user, system, nice, idle percentages.id(idle) below 20% means the CPU is under heavy load.wa(I/O wait) — if this is high (>10%), a slow disk is blocking processes, not the CPU itself.
Press 1 in top to see per-core stats. Press P to sort by CPU usage. Press q to quit.
Better: htop
sudo apt install htop -y
htop
htop is top with color, mouse support, and easier navigation. F6 sorts by column. F9 sends signals to processes. F10 quits.
CPU stats per core over time:
mpstat -P ALL 2 5
# CPU stats for all cores, every 2 seconds, 5 samples
Find what’s eating CPU right now:
# Sort processes by CPU, show top 10
ps aux --sort=-%cpu | head -10
How do you check memory usage?
Quick overview:
free -h
Output:
total used free shared buff/cache available
Mem: 7.7G 2.1G 1.2G 256M 4.4G 5.1G
Swap: 2.0G 0B 2.0G
The available column is what matters — it includes memory that can be reclaimed from buff/cache. free memory alone is misleading because Linux uses spare memory for caching disk reads; that cached memory is available to applications on demand.
Concerning signs:
availableapproaching zero- Swap
usedgrowing over time (swap use itself isn’t bad; growing swap use means you’re leaking memory) availabledropping after afreevalue is already low
Watch memory over time:
watch -n 2 free -h
# Refreshes every 2 seconds
Find which processes are using the most memory:
ps aux --sort=-%mem | head -10
Detailed memory breakdown:
cat /proc/meminfo
Key fields: MemAvailable, SwapTotal, SwapFree, Cached, Buffers.
How do you check disk usage?
Disk space by filesystem:
df -h
Look for any filesystem near 100%. A full / will crash the server. A full /var often kills logging and databases. Set up an alert when any filesystem hits 80%.
What’s using the space:
# Top-level directories in /var
sudo du -sh /var/* | sort -h
# Drill down into a specific directory
sudo du -sh /var/log/* | sort -h
# Find largest files anywhere on the system
sudo find / -type f -size +100M 2>/dev/null | sort -n
Common culprits on production servers:
/var/log— logs that aren’t rotating- Docker storage at
/var/lib/docker— old images and stopped containers - Database data directories
- Application caches and temp files
Clean up Docker storage:
# Show Docker disk usage
docker system df
# Remove stopped containers, unused images, orphan volumes
docker system prune -f
# Also remove volumes (destructive — check first)
docker system prune -f --volumes
Disk I/O — is the disk the bottleneck?
sudo apt install iotop -y
sudo iotop -o
# Shows only processes doing active I/O
In top, wa (I/O wait) above 5-10% is a signal that processes are waiting on disk reads or writes.
How do you check network usage?
Active connections and listening ports:
ss -tlnp # TCP listening sockets with process names
ss -s # Summary stats
Network traffic in real time:
sudo apt install nload -y
nload eth0
Per-connection bandwidth:
sudo apt install nethogs -y
sudo nethogs eth0
How do you check what’s running?
All running services (systemd):
systemctl list-units --type=service --state=running
Recent service failures:
journalctl -p err -n 50
# Last 50 error-level log entries across all services
Specific service logs:
journalctl -u nginx -n 100 -f
# Follow nginx logs in real time
System uptime and load:
uptime
# 10:23:45 up 42 days, 3:12, 2 users, load average: 0.45, 0.38, 0.32
How do you set up persistent monitoring?
Checking tools manually doesn’t catch problems that happen at 3am. You need something that runs continuously and alerts you.
Option 1: Netdata — zero config, beautiful dashboards, alerting built in
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
After installation, the dashboard is at http://your-server-ip:19999. Netdata comes with pre-configured alerts for CPU, memory, disk, and dozens of other metrics. No manual configuration required to get useful monitoring.
Option 2: Prometheus + Grafana — more complex, more powerful
Install node_exporter on each server to expose metrics:
# Download and install node_exporter
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-linux-amd64.tar.gz
tar xzf node_exporter-linux-amd64.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
# Run as a service
sudo useradd -rs /bin/false node_exporter
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=default.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Metrics are now at http://your-server-ip:9100/metrics. Point Prometheus at that endpoint, then build Grafana dashboards from the collected data.
Prometheus + Grafana is the industry standard for self-hosted infrastructure monitoring — Grafana’s node exporter dashboard (ID 1860) gives you CPU, memory, disk, and network in one import.
Option 3: Simple cron-based disk alert
If a full monitoring stack is overkill, a cron job that emails you when disk is over 80% is better than nothing:
crontab -e
Add:
# Check disk at 8am every day
0 8 * * * df -h | awk 'NR>1 {gsub(/%/,""); if ($5 > 80) print "DISK ALERT: "$0}' | mail -s "Disk Alert $(hostname)" you@example.com
One-liner server health check
A quick command to paste when you need a snapshot:
echo "=== UPTIME ===" && uptime && \
echo "=== CPU ===" && mpstat 1 1 2>/dev/null || top -bn1 | grep "Cpu(s)" && \
echo "=== MEMORY ===" && free -h && \
echo "=== DISK ===" && df -h && \
echo "=== TOP PROCESSES ===" && ps aux --sort=-%cpu | head -6
Summary
- Load average vs. CPU count tells you if the CPU is under pressure;
id(idle) below 20% confirms it availablememory (notfree) is the real number to watch; growing swap use means a memory leak- Disk health is critical —
df -hfor overview,du -shto drill down; full/varkills logs and databases iotopfor disk I/O bottlenecks;nethogsfor per-process network usage- Netdata is the fastest path to persistent monitoring with alerts; Prometheus + Grafana is the production standard
FAQ
My load average is 4.0 but the server feels fine. Is that bad?
It depends on your CPU count. Load average of 4.0 on a 4-core server means it’s running at capacity — no headroom. On an 8-core server, 4.0 means it’s at 50%. Run nproc to see your core count. A good rule of thumb: alert when load average > (CPU cores × 0.8) for more than 5 minutes.
The server is slow but CPU and memory look fine. What else should I check?
I/O wait (wa column in top) is the first thing to check. High I/O wait means processes are blocked waiting for the disk. Run sudo iotop -o to see which process is doing the I/O. After that, check network with ss -s and nethogs. Finally, check database slow query logs — a slow query running in a loop looks like CPU idle but causes all application threads to stack up.
How do I find what deleted file is keeping my disk full?
When a file is deleted while a process still has it open, the space isn’t freed until the process releases it. df shows low free space but du doesn’t account for the space. Find the culprit:
sudo lsof +L1
# Lists all open files with link count 0 (deleted but held open)
The output shows which process is holding the file open. Restarting that process releases the disk space.
What to read next
- Linux Bash Cheat Sheet — essential commands for server management
- SSH Hardening Guide — securing access before granting access to monitoring dashboards
- How to Set Up a Reverse Proxy with Nginx — proxy your monitoring dashboards securely