Infrastructure Monitoring with Prometheus: Beginner's Guide to Metrics, Alerts, and Dashboards
Infrastructure monitoring is crucial for maintaining the performance and availability of your servers, containers, and services. In this beginner’s guide, we will delve into using Prometheus, an open-source monitoring toolkit, to help system administrators, site reliability engineers (SREs), and small operations teams effectively track metrics, set up alerts, and create dashboards. Expect a comprehensive walkthrough of core concepts, a hands-on Docker setup, essential exporters, PromQL basics, visualization with Grafana, and best practices for scaling and troubleshooting.
Core Concepts You Need to Know
Understanding Prometheus and monitoring concepts starts with a grasp of a few fundamentals.
Time Series Fundamentals
- Each metric is a time series identified by a metric name and zero or more labels (key/value pairs). An example would be
node_cpu_seconds_total{mode="user", cpu="0"}. - Labels enable users to slice and aggregate metrics by various dimensions like instance, job, or region.
Metric Types (When to Use Each)
| Type | What It Represents | When to Use |
|---|---|---|
| Counter | A value that only increases (or resets) | Use for request counts, bytes sent, or errors. Use rate() to compute per-second values. |
| Gauge | A value that can fluctuate up or down | Ideal for monitoring memory usage, temperature, or queue depth. |
| Histogram | Buckets of observed values with counts and sums | Useful for recording request latency distributions (you can compute approximated quantiles). |
| Summary | Similar to a histogram but with client-side quantiles | Best for tracking request latency per instance with calculated quantiles. |
Prometheus Scraping (Pull Model)
Prometheus periodically scrapes metrics from HTTP endpoints, typically using the /metrics path. The configuration file (prometheus.yml) defines scrape_configs with jobs and targets, allowing Prometheus to manage the scrape intervals and facilitate easier target discovery.
Exporters
Exporters are tools that expose metrics for systems that do not provide Prometheus metrics natively. Common exporters include:
node_exporter: Collects OS and hardware metrics (CPU, memory, disk, network).cAdvisor: Monitors container resource metrics for Docker containers.blackbox_exporter: Probes external endpoints (HTTP/TCP/ICMP) for availability.kube-state-metricsand kubelet metrics: Provides Kubernetes cluster state metrics (Pods, Deployments, resource requests).
Prometheus Server Responsibilities
The Prometheus server takes care of scraping metrics, storing them in the local time-series database (TSDB), evaluating alerting and recording rules, and sending alerts to Alertmanager.
PromQL Basics
PromQL is the query language that you will use to select, aggregate, and compute on time series data. Some core patterns include:
- Selectors:
node_cpu_seconds_total{job="node"}(instant vector) ornode_cpu_seconds_total[5m](range vector). - Functions like
rate()andincrease()convert monotonically increasing counters into per-second rates or increases over specific time windows. - Aggregations using syntax like
sum(...) by (instance)oravg(... ) by (job).
Alertmanager
Alertmanager receives alerts from Prometheus, deduplicates them, applies routing rules, and forwards notifications to various receivers such as email, Slack, or PagerDuty.
Visualization
Grafana serves as the default dashboarding tool for Prometheus metrics.
Quick Hands-On: Get Prometheus Running Locally (Docker)
Prerequisites
- Docker or Docker Compose is required. On Windows, use WSL2 or Docker Desktop (refer to our WSL installation guide here and WSL configuration guide here).
Create a docker-compose.yml
Here’s a minimal stack configuration involving Prometheus, node_exporter, and Grafana:
version: '3.7'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
ports:
- 9090:9090
node_exporter:
image: prom/node-exporter:latest
ports:
- 9100:9100
command:
- '--path.rootfs=/host'
volumes:
- /:/host:ro
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
ports:
- 3000:3000
A basic prometheus.yml configuration would look like this:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'node'
static_configs:
- targets: ['node_exporter:9100']
Start the Stack
Run:
docker-compose up -d
Verify
- Prometheus UI: Visit http://localhost:9090 and check the
/targetspage for active targets. - Node Exporter metrics: Access http://localhost:9100/metrics to see metrics like
node_cpu_seconds_total. - Grafana: Go to http://localhost:3000 (username: admin, password: admin), and add Prometheus as a data source using
http://prometheus:9090(orhttp://host.docker.internal:9090if using Docker Desktop).
Quick Tips
- On Windows, WSL2 is recommended to run Linux-based containers and tooling (Install WSL on Windows guide).
- If metrics are missing, check Prometheus’s
/targetspage for any scrape errors.
For additional installation options and examples, view the Prometheus documentation.
Essential Exporters and What They Monitor
- node_exporter: Monitors host-level metrics including CPU, memory, disk, and network.
- cAdvisor: Tracks resource usage of Docker containers.
- blackbox_exporter: Probes HTTP, DNS, and TCP endpoints for availability (ideal for uptime checks).
- kube-state-metrics: Exports Kubernetes API object states (Deployments, Pods) as metrics. Combine this with kubelet metrics for complete insight into your Kubernetes cluster (see Kubernetes metrics pipeline documentation).
- Application exporters: Such as
jmx_exporter,postgres_exporter,mysqld_exporter, andredis_exporterfor app-specific metrics beyond standard resource usage.
How to Find Exporters
Official Prometheus exporters are available on GitHub and Docker Hub. Evaluate each exporter by examining activity, open issues, and documentation.
If you operate containers, consider our Container Networking guide for insights into making exporter endpoints accessible across networks and container platforms.
PromQL Essentials — Queries You’ll Use Every Day
PromQL is crucial for mining insights from your metrics. Below are some common patterns with concrete examples.
Selectors and Vectors
- Instant vector: Use to get the current value for each matching series. Example:
node_memory_MemAvailable_bytes. - Range vector: For values over a specific time window:
node_cpu_seconds_total[5m].
rate() and increase()
Counters must be converted to rates since they monotonically increase and reset. Use rate() for per-second rates over defined windows.
Examples:
-
CPU usage (per-core and total)
- Per-core usage:
100 * (1 - avg by (cpu, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))- Total CPU usage:
100 * (1 - sum by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))/count(node_cpu_seconds_total{mode="idle"})) -
Memory usage
- Available memory ratio:
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 -
Disk I/O
- Read throughput:
rate(node_disk_read_bytes_total[5m])- Write throughput:
rate(node_disk_written_bytes_total[5m]) -
Network throughput
- Receive and transmit:
rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m])
Aggregations and Grouping
Functions such as sum(), avg(), and max() can be used with by(...) to summarize data across different dimensions. For example, total network received per job:
sum(rate(node_network_receive_bytes_total[5m])) by (job)
Recording Rules
Recording rules allow you to persist the results of queries as new time series, improving dashboard performance and minimizing repeated heavy queries. You can define these in a rules file referenced in the Prometheus configuration.
Visualizing Metrics with Grafana
Add Prometheus as a data source in Grafana (see Grafana documentation: Getting Started with Prometheus).
- Data Source URL:
http://<prometheus_host>:9090. - Create dashboards featuring typical panels such as CPU, memory, disk, and network usage.
- Common panel types include time series graphs, gauges, stat/single-value panels, and tables.
- Consider importing community dashboards like “Node Exporter Full” from Grafana’s dashboard library and customize them as needed.
Dashboard Hygiene Tips
- Use variables (templating) for
instanceorjobto reuse panels across hosts. - Limit panel queries to appropriate time ranges to reduce load.
- Export dashboards as JSON for version control and sharing.
Alerting with Prometheus and Alertmanager
Write alerting rules in Prometheus using expressions that evaluate to true when an alert should be triggered. Example alert rules can be placed in a rules.yml file referenced by prometheus.yml.
Example: High CPU sustained (> 80% for 5 minutes)
- alert: HighCpuLoad
expr: 100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 80
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
runbook_url: "https://your-runbooks.example.com/high-cpu"
Disk space low (< 10% free)
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
Node down
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
Best Practices for Alerting
- Only trigger alerts on actionable and human-readable conditions.
- Utilize
for:to avoid alert flapping from transient spikes. - Add
runbook_urland helpful annotations for responders.
Scaling Prometheus & Best Practices for Production
While Prometheus performs well in small to medium environments, it can face challenges with long retention or high cardinality. Consider the following options for scaling:
- Federation: Scraping metrics from downstream Prometheus instances in multi-cluster or hierarchical setups.
- Remote Write/Read: Utilize long-term storage solutions like Thanos, Cortex, or Mimir for scalable storage and querying.
Security
Implement security measures for Prometheus and metric endpoints, such as TLS, network segmentation, and authentication. Avoid exposing /metrics endpoints publicly.
Observability Hygiene
- Maintain awareness of label cardinality; avoid high cardinality labels (e.g., user IDs) to prevent overwhelming series counts.
- Apply
relabel_configsto drop or transform labels during scraping. - Use consistent naming conventions for metrics and labels to streamline downstream analysis.
Common Pitfalls and How to Troubleshoot Them
- High Cardinality: Monitor labels such as
instance_idandcontainer_id. Reduce cardinality by trimming or dropping labels. - Raw Counters: Remember to always use
rate()for counters, as they can appear as large, increasing numbers. - Missing Metrics: Check the Prometheus
/targetspage and the exporter/metricsendpoint. Misconfigurations in firewalls or container networks often lead to this issue (see our Container Networking guide). - Slow Queries: Utilize recording rules to persist heavy computations and avoid complex range queries on dashboards.
- Storage Issues: Monitor TSDB size and check Prometheus logs for compaction errors; consider remote storage if long retention is required.
10-Minute Runbook: From Zero to a Useful Dashboard
This checklist will guide you in setting up a basic monitoring system swiftly:
- Install Docker and Docker Compose (or run locally with binaries).
- Create a
docker-compose.ymlthat includes Prometheus,node_exporter, and Grafana. - Construct a
prometheus.ymlwithscrape_configsfor Prometheus andnode_exporter. - Start the stack using
docker-compose up -d. - Confirm the node exporter status by running
curl http://localhost:9100/metrics. - Open Prometheus at http://localhost:9090 and check
/targets. - Access Grafana at http://localhost:3000 (admin/admin) and add Prometheus as a data source.
- Import a Node Exporter dashboard from Grafana’s library.
- Create at least one alert rule (e.g., High CPU) and configure Alertmanager to notify your via email or Slack. Enhance Alertmanager setup with a basic
alertmanagerblock or run it via Docker. - Test an alert by adjusting thresholds or increasing CPU usage temporarily.
For useful commands, run:
docker-compose logs prometheuscurl http://localhost:9090/targetscurl http://localhost:9100/metrics | head
In case of issues, refer to /targets for scrape errors, review exporter logs, and verify network connectivity.
Conclusion & Next Steps
You now have a solid understanding of Prometheus fundamentals, including architecture, exporters, PromQL basics, Grafana dashboards, and alerting with Alertmanager. Consider the following next steps:
- Instrument a simple application with a Prometheus client library (available in Go, Python, Java) to export custom metrics.
- Investigate long-term storage solutions like Thanos, Cortex, or Mimir for enhanced retention and global queries.
- If you maintain a Kubernetes environment, use kube-state-metrics and kubelet scraping for comprehensive cluster insights (see the Kubernetes metrics pipeline documentation).
Further Reading & Resources
- Official Prometheus documentation — Introduction and overview
- Grafana documentation — Get started with Prometheus
- Kubernetes monitoring documentation — Metrics pipeline
You might find the following internal guides helpful:
- Container networking basics — Useful when troubleshooting scrape connectivity.
- Install WSL on Windows — Run Prometheus locally on Windows.
- WSL configuration guide — Handy for setting up local labs on Windows.
- Windows containers & Docker integration — If running Prometheus in Windows containers.
- Windows Event Log and Performance Monitor guides — Helpful for correlating traces and traditional metrics.
- Building a home lab hardware guide — Useful for running a monitoring stack on local hardware.
Explore community dashboards and exporter repositories:
- Grafana dashboard library (search for “Node Exporter Full”).
- Official Prometheus exporter repositories on GitHub:
prom/node-exporter,prom/cadvisor,prometheus/blackbox_exporter.
This guide serves as a comprehensive starting point for anyone looking to implement effective infrastructure monitoring with Prometheus.