Infrastructure Monitoring with Prometheus: Beginner's Guide to Metrics, Alerts, and Dashboards

Updated on
11 min read

Infrastructure monitoring is crucial for maintaining the performance and availability of your servers, containers, and services. In this beginner’s guide, we will delve into using Prometheus, an open-source monitoring toolkit, to help system administrators, site reliability engineers (SREs), and small operations teams effectively track metrics, set up alerts, and create dashboards. Expect a comprehensive walkthrough of core concepts, a hands-on Docker setup, essential exporters, PromQL basics, visualization with Grafana, and best practices for scaling and troubleshooting.

Core Concepts You Need to Know

Understanding Prometheus and monitoring concepts starts with a grasp of a few fundamentals.

Time Series Fundamentals

  • Each metric is a time series identified by a metric name and zero or more labels (key/value pairs). An example would be node_cpu_seconds_total{mode="user", cpu="0"}.
  • Labels enable users to slice and aggregate metrics by various dimensions like instance, job, or region.

Metric Types (When to Use Each)

TypeWhat It RepresentsWhen to Use
CounterA value that only increases (or resets)Use for request counts, bytes sent, or errors. Use rate() to compute per-second values.
GaugeA value that can fluctuate up or downIdeal for monitoring memory usage, temperature, or queue depth.
HistogramBuckets of observed values with counts and sumsUseful for recording request latency distributions (you can compute approximated quantiles).
SummarySimilar to a histogram but with client-side quantilesBest for tracking request latency per instance with calculated quantiles.

Prometheus Scraping (Pull Model)

Prometheus periodically scrapes metrics from HTTP endpoints, typically using the /metrics path. The configuration file (prometheus.yml) defines scrape_configs with jobs and targets, allowing Prometheus to manage the scrape intervals and facilitate easier target discovery.

Exporters

Exporters are tools that expose metrics for systems that do not provide Prometheus metrics natively. Common exporters include:

  • node_exporter: Collects OS and hardware metrics (CPU, memory, disk, network).
  • cAdvisor: Monitors container resource metrics for Docker containers.
  • blackbox_exporter: Probes external endpoints (HTTP/TCP/ICMP) for availability.
  • kube-state-metrics and kubelet metrics: Provides Kubernetes cluster state metrics (Pods, Deployments, resource requests).

Prometheus Server Responsibilities

The Prometheus server takes care of scraping metrics, storing them in the local time-series database (TSDB), evaluating alerting and recording rules, and sending alerts to Alertmanager.

PromQL Basics

PromQL is the query language that you will use to select, aggregate, and compute on time series data. Some core patterns include:

  • Selectors: node_cpu_seconds_total{job="node"} (instant vector) or node_cpu_seconds_total[5m] (range vector).
  • Functions like rate() and increase() convert monotonically increasing counters into per-second rates or increases over specific time windows.
  • Aggregations using syntax like sum(...) by (instance) or avg(... ) by (job).

Alertmanager

Alertmanager receives alerts from Prometheus, deduplicates them, applies routing rules, and forwards notifications to various receivers such as email, Slack, or PagerDuty.

Visualization

Grafana serves as the default dashboarding tool for Prometheus metrics.

Quick Hands-On: Get Prometheus Running Locally (Docker)

Prerequisites

  • Docker or Docker Compose is required. On Windows, use WSL2 or Docker Desktop (refer to our WSL installation guide here and WSL configuration guide here).

Create a docker-compose.yml

Here’s a minimal stack configuration involving Prometheus, node_exporter, and Grafana:

version: '3.7'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    ports:
      - 9090:9090

  node_exporter:
    image: prom/node-exporter:latest
    ports:
      - 9100:9100
    command:
      - '--path.rootfs=/host'
    volumes:
      - /:/host:ro

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    ports:
      - 3000:3000

A basic prometheus.yml configuration would look like this:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']

Start the Stack

Run:

docker-compose up -d

Verify

  • Prometheus UI: Visit http://localhost:9090 and check the /targets page for active targets.
  • Node Exporter metrics: Access http://localhost:9100/metrics to see metrics like node_cpu_seconds_total.
  • Grafana: Go to http://localhost:3000 (username: admin, password: admin), and add Prometheus as a data source using http://prometheus:9090 (or http://host.docker.internal:9090 if using Docker Desktop).

Quick Tips

  • On Windows, WSL2 is recommended to run Linux-based containers and tooling (Install WSL on Windows guide).
  • If metrics are missing, check Prometheus’s /targets page for any scrape errors.

For additional installation options and examples, view the Prometheus documentation.

Essential Exporters and What They Monitor

  • node_exporter: Monitors host-level metrics including CPU, memory, disk, and network.
  • cAdvisor: Tracks resource usage of Docker containers.
  • blackbox_exporter: Probes HTTP, DNS, and TCP endpoints for availability (ideal for uptime checks).
  • kube-state-metrics: Exports Kubernetes API object states (Deployments, Pods) as metrics. Combine this with kubelet metrics for complete insight into your Kubernetes cluster (see Kubernetes metrics pipeline documentation).
  • Application exporters: Such as jmx_exporter, postgres_exporter, mysqld_exporter, and redis_exporter for app-specific metrics beyond standard resource usage.

How to Find Exporters

Official Prometheus exporters are available on GitHub and Docker Hub. Evaluate each exporter by examining activity, open issues, and documentation.

If you operate containers, consider our Container Networking guide for insights into making exporter endpoints accessible across networks and container platforms.

PromQL Essentials — Queries You’ll Use Every Day

PromQL is crucial for mining insights from your metrics. Below are some common patterns with concrete examples.

Selectors and Vectors

  • Instant vector: Use to get the current value for each matching series. Example: node_memory_MemAvailable_bytes.
  • Range vector: For values over a specific time window: node_cpu_seconds_total[5m].

rate() and increase()

Counters must be converted to rates since they monotonically increase and reset. Use rate() for per-second rates over defined windows.

Examples:

  1. CPU usage (per-core and total)

    • Per-core usage:
    100 * (1 - avg by (cpu, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
    
    • Total CPU usage:
    100 * (1 - sum by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))/count(node_cpu_seconds_total{mode="idle"}))
    
  2. Memory usage

    • Available memory ratio:
    (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
    
  3. Disk I/O

    • Read throughput:
    rate(node_disk_read_bytes_total[5m])
    
    • Write throughput:
    rate(node_disk_written_bytes_total[5m])
    
  4. Network throughput

    • Receive and transmit:
    rate(node_network_receive_bytes_total[5m])
    rate(node_network_transmit_bytes_total[5m])
    

Aggregations and Grouping

Functions such as sum(), avg(), and max() can be used with by(...) to summarize data across different dimensions. For example, total network received per job:

sum(rate(node_network_receive_bytes_total[5m])) by (job)

Recording Rules

Recording rules allow you to persist the results of queries as new time series, improving dashboard performance and minimizing repeated heavy queries. You can define these in a rules file referenced in the Prometheus configuration.

Visualizing Metrics with Grafana

Add Prometheus as a data source in Grafana (see Grafana documentation: Getting Started with Prometheus).

  • Data Source URL: http://<prometheus_host>:9090.
  • Create dashboards featuring typical panels such as CPU, memory, disk, and network usage.
  • Common panel types include time series graphs, gauges, stat/single-value panels, and tables.
  • Consider importing community dashboards like “Node Exporter Full” from Grafana’s dashboard library and customize them as needed.

Dashboard Hygiene Tips

  • Use variables (templating) for instance or job to reuse panels across hosts.
  • Limit panel queries to appropriate time ranges to reduce load.
  • Export dashboards as JSON for version control and sharing.

Alerting with Prometheus and Alertmanager

Write alerting rules in Prometheus using expressions that evaluate to true when an alert should be triggered. Example alert rules can be placed in a rules.yml file referenced by prometheus.yml.

Example: High CPU sustained (> 80% for 5 minutes)

- alert: HighCpuLoad
  expr: 100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > 80
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    runbook_url: "https://your-runbooks.example.com/high-cpu"

Disk space low (< 10% free)

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Low disk space on {{ $labels.instance }}"

Node down

- alert: InstanceDown
  expr: up == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Instance {{ $labels.instance }} is down"

Best Practices for Alerting

  • Only trigger alerts on actionable and human-readable conditions.
  • Utilize for: to avoid alert flapping from transient spikes.
  • Add runbook_url and helpful annotations for responders.

Scaling Prometheus & Best Practices for Production

While Prometheus performs well in small to medium environments, it can face challenges with long retention or high cardinality. Consider the following options for scaling:

  • Federation: Scraping metrics from downstream Prometheus instances in multi-cluster or hierarchical setups.
  • Remote Write/Read: Utilize long-term storage solutions like Thanos, Cortex, or Mimir for scalable storage and querying.

Security

Implement security measures for Prometheus and metric endpoints, such as TLS, network segmentation, and authentication. Avoid exposing /metrics endpoints publicly.

Observability Hygiene

  • Maintain awareness of label cardinality; avoid high cardinality labels (e.g., user IDs) to prevent overwhelming series counts.
  • Apply relabel_configs to drop or transform labels during scraping.
  • Use consistent naming conventions for metrics and labels to streamline downstream analysis.

Common Pitfalls and How to Troubleshoot Them

  • High Cardinality: Monitor labels such as instance_id and container_id. Reduce cardinality by trimming or dropping labels.
  • Raw Counters: Remember to always use rate() for counters, as they can appear as large, increasing numbers.
  • Missing Metrics: Check the Prometheus /targets page and the exporter /metrics endpoint. Misconfigurations in firewalls or container networks often lead to this issue (see our Container Networking guide).
  • Slow Queries: Utilize recording rules to persist heavy computations and avoid complex range queries on dashboards.
  • Storage Issues: Monitor TSDB size and check Prometheus logs for compaction errors; consider remote storage if long retention is required.

10-Minute Runbook: From Zero to a Useful Dashboard

This checklist will guide you in setting up a basic monitoring system swiftly:

  1. Install Docker and Docker Compose (or run locally with binaries).
  2. Create a docker-compose.yml that includes Prometheus, node_exporter, and Grafana.
  3. Construct a prometheus.yml with scrape_configs for Prometheus and node_exporter.
  4. Start the stack using docker-compose up -d.
  5. Confirm the node exporter status by running curl http://localhost:9100/metrics.
  6. Open Prometheus at http://localhost:9090 and check /targets.
  7. Access Grafana at http://localhost:3000 (admin/admin) and add Prometheus as a data source.
  8. Import a Node Exporter dashboard from Grafana’s library.
  9. Create at least one alert rule (e.g., High CPU) and configure Alertmanager to notify your via email or Slack. Enhance Alertmanager setup with a basic alertmanager block or run it via Docker.
  10. Test an alert by adjusting thresholds or increasing CPU usage temporarily.

For useful commands, run:

  • docker-compose logs prometheus
  • curl http://localhost:9090/targets
  • curl http://localhost:9100/metrics | head

In case of issues, refer to /targets for scrape errors, review exporter logs, and verify network connectivity.

Conclusion & Next Steps

You now have a solid understanding of Prometheus fundamentals, including architecture, exporters, PromQL basics, Grafana dashboards, and alerting with Alertmanager. Consider the following next steps:

  • Instrument a simple application with a Prometheus client library (available in Go, Python, Java) to export custom metrics.
  • Investigate long-term storage solutions like Thanos, Cortex, or Mimir for enhanced retention and global queries.
  • If you maintain a Kubernetes environment, use kube-state-metrics and kubelet scraping for comprehensive cluster insights (see the Kubernetes metrics pipeline documentation).

Further Reading & Resources

You might find the following internal guides helpful:

Explore community dashboards and exporter repositories:

  • Grafana dashboard library (search for “Node Exporter Full”).
  • Official Prometheus exporter repositories on GitHub: prom/node-exporter, prom/cadvisor, prometheus/blackbox_exporter.

This guide serves as a comprehensive starting point for anyone looking to implement effective infrastructure monitoring with Prometheus.

TBO Editorial

About the Author

TBO Editorial writes about the latest updates about products and services related to Technology, Business, Finance & Lifestyle. Do get in touch if you want to share any useful article with our community.