Kubernetes Cluster Monitoring: A Beginner’s Practical Guide

Updated on
11 min read

Monitoring in a Kubernetes environment plays a crucial role in managing and maintaining the health, performance, and behavior of your clusters and applications. By continuously collecting and visualizing data, you can swiftly detect issues and ensure that your deployments meet service-level agreements (SLAs). This guide is tailored for developers, site reliability engineers (SREs), and operators running small to medium Kubernetes clusters who are eager to learn effective monitoring strategies. In this article, you’ll discover essential concepts, a beginner-friendly quickstart using Prometheus and Grafana, as well as alerting practices and troubleshooting tips.


Monitoring Fundamentals — Data Types and Concepts

Before installing any monitoring tools, it’s essential to grasp the key data types and concepts used in modern observability:

  • Metrics: Numeric time-series data (e.g., CPU usage, request latency). Ideal for dashboards and triggering alerts.
  • Logs: Unstructured or structured text data like application logs, useful for detailed debugging.
  • Traces: Distributed traces illustrate request flows across services, valuable for analyzing microservices latency.
  • Events: Kubernetes events indicate state changes (e.g., CrashLoopBackOff, NodePressure).

Pull vs Push Models

  • Pull model: Prometheus scrapes endpoints (HTTP) on a schedule, which is the default approach in Kubernetes when it can access targets.
  • Push model: Clients push metrics to an intermediary (Pushgateway or metrics gateway), useful for short-lived jobs.

Labels and Time-Series Model

Prometheus stores metrics as time-series, identified by a metric name paired with key/value labels. While labels are powerful for slicing data (by namespace, pod, or service), excessive unique label combinations can degrade performance and increase storage costs.

Important Operational Knobs

  • Sampling frequency (scrape interval): Determines how often metrics are collected.
  • Retention: Duration metrics are stored before deletion or downsampling.
  • Cardinality: Management of label dimensions to prevent series explosion.

For a deeper dive into Prometheus concepts, see the Prometheus overview.


What to Monitor in a Kubernetes Cluster

Start with the most impactful signals before expanding your monitoring strategy:

Cluster-Level Metrics

  • API server health: request latency, error rates.
  • Health of the controller-manager and scheduler.
  • etcd health and disk usage (if managing the control plane).

Node Metrics

  • CPU, memory, disk usage (capacity vs. allocation).
  • Node conditions: Ready, DiskPressure, MemoryPressure.
  • Metrics from kubelet and cAdvisor.

Workload-Level Metrics

  • Pod CPU/memory usage, container restarts, OOMKilled occurrences.
  • Replica counts and deployment rollout status.

Networking

  • Service availability and latency.
  • DNS resolution times and errors (CoreDNS metrics).
  • Container Network Interface (CNI) plugin errors or dropped packets.

Storage

  • PersistentVolumeClaim (PVC) usage.
  • I/O latency and throughput.
  • Disk saturation leading to DiskPressure.

Application-Level Metrics

  • Request latency (p95, p99), error rates, and throughput (requests per second).
  • Business-specific metrics (e.g., orders processed, messages queued).

Begin with high-impact metrics: node CPU/memory, pod restarts (kube_pod_container_status_restarts_total), API server availability, and PVC usage. Monitoring pod restarts and OOMKilled events can help identify misconfigured resource limits.

If your environment frequently experiences network issues, refer to our deeper dive on container networking.


Utilize the following comparison to help select a suitable monitoring stack:

StackStrengthsTradeoffs / When to Use
Prometheus + Alertmanager + GrafanaOpen-source metrics management; powerful PromQL queries; numerous community dashboards.Excellent per-cluster metrics but needs planning for long-term storage and high availability (HA).
Thanos / CortexEnables global aggregation, long-term retention, and multi-cluster scaling.More complex to operate and incurs additional costs but addresses scaling/HA.
ELK (Elastic) / OpenSearchFull-text search and analytics for logs.Higher storage and indexing costs; resource-intensive.
Loki + GrafanaLabel-centric logs, tightly integrates with Prometheus.Not a full-text indexer; optimized for cost and correlation, rather than all search use-cases.
Managed SaaS (Datadog, New Relic, Google Cloud Ops)Quick onboarding and comprehensive APM/logs/metrics.High costs at scale and concerns about vendor lock-in.
Lightweight (kube-prometheus-stack via Helm)Quick setup for beginners/labs.Not ideal for enterprise long-term retention without backends like Thanos.
  • For global views across clusters, leverage Thanos or Cortex to aggregate and downsample metrics.
  • For log collection, consider Fluent Bit (lightweight), Fluentd (flexible), or Filebeat for robust forwarding to Elastic/OpenSearch, Loki, or cloud logging solutions.

Managed services may reduce operational burden, making them a good choice if you prefer to outsource monitoring solutions.


Quickstart: Prometheus + Grafana at a Glance (Beginner Friendly)

The kube-prometheus-stack is a popular Helm chart that combines Prometheus Operator, Prometheus instances, Alertmanager, Grafana, node-exporter, kube-state-metrics, and sample dashboards.

Key Features

  • Prometheus Operator for managing instances and rules as Kubernetes Custom Resource Definitions (CRDs).
  • Node-exporter for collecting node-level metrics.
  • Kube-state-metrics to export Kubernetes API state (deployments, pods, PVCs).
  • Grafana with pre-built dashboards.
  • Alertmanager for routing and silencing alerts.

Installation Steps (High-Level)

  1. Install and configure Helm: if you haven’t already.

  2. Add the Prometheus community Helm repository and update:

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    
  3. Install kube-prometheus-stack in a monitoring namespace:

    kubectl create namespace monitoring
    helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
      --namespace monitoring
    

For detailed installation procedures and chart values, check the chart repository: kube-prometheus-stack.

  • Dashboards to Enable First: Cluster Overview, Node Exporter Full, kube-state-metrics Overview, and an Overview per namespace/application.
  • Initial Alerts to Configure:
    • instance_down (Prometheus job unreachable)
    • node_high_cpu and node_high_memory
    • kube_api_server_down or high error rate
    • pod_crashlooping (CrashLoopBackOff)
    • pvc_near_capacity

Example Queries to Validate Scrapes

  • Node CPU usage (instant):

    100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
  • Pod restarts over time:

    increase(kube_pod_container_status_restarts_total{namespace="default"}[1h])
    

Sample Prometheus Alert Example

Set up an alert that triggers when a pod is repeatedly restarting:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-restart-rules
  namespace: monitoring
spec:
  groups:
  - name: pod.rules
    rules:
    - alert: PodFrequentRestarts
      expr: increase(kube_pod_container_status_restarts_total[15m]) > 3
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
        description: "Container restarts count exceeded threshold"

Validating the Setup

  • Open Grafana (use port-forward or LoadBalancer) and check the dashboards.
  • Go to Prometheus targets page to confirm that node-exporter, kube-state-metrics, and kubelets are being scraped.
  • Execute the example PromQL queries in Grafana Explore or the Prometheus UI.

For insights on the methods and roles of kubelet, cAdvisor, and kube-state-metrics, see the Kubernetes official documentation on resource usage monitoring: Resource Usage Monitoring.


Logs and Tracing — What to Collect and How

Why Collect Logs and Traces?

While metrics indicate a problem, logs and traces reveal the underlying causes.

Centralized Log Collection Options

  • Fluent Bit: Lightweight and suitable for Kubernetes and edge scenarios.
  • Fluentd: Feature-rich and customizable with numerous plugins.
  • Filebeat: Works efficiently with Elastic/OpenSearch.

Log Storage Backends

  • Elastic / OpenSearch: Enables full-text search and analytics.
  • Loki: A cost-effective solution designed for label-oriented logs that integrate with Prometheus labels and Grafana.
  • Managed logging services: Preferred for reduced operational overhead.

Tracing

OpenTelemetry (OTel) serves as the vendor-neutral standard for application instrumentation. Popular backends for tracing include Jaeger and Zipkin, which visualize traces across services to identify high-latency issues.

Tip: Grafana Loki allows seamless integration with Prometheus and Grafana, enabling navigation from a metric spike to corresponding logs matching the labels.


Alerting: Best Practices for Beginners

Effective alerts should be actionable, with a defined owner and next steps.

  • Classify Severity: Label alerts with critical (pager), warning (ticket), or informational (dashboard event).
  • Avoid Alert Fatigue: Fine-tune thresholds, introduce “for” delays, group related alerts, and deduplicate.
  • Provide Runbooks: Link alerts to playbooks for immediate remediation steps.
  • Silence Alerts: During maintenance windows, alerts should be temporarily disabled.
  • Test Alerts: Regularly simulate failures to confirm alerts reach the on-call team.

Example severity labels in Alertmanager routing can be adjusted to convey different notification channels based on urgency (SMS/pager for critical issues, Slack/email for warnings).


Dashboards and Visualizations — What Good Looks Like

Create dashboards categorized by the following:

  • Cluster Overview: API server latency, etcd health, node capacity, overall resource utilization.
  • Node Dashboards: Per-node CPU, memory, and disk metrics, plus kubelet health insights.
  • Workload/Application Dashboards: Percentiles for request latency (p50/p95/p99), error rates, and throughput.
  • Networking: Monitor DNS errors, service latency, and CNI statistics.
  • Storage: PVC usage and disk I/O latency metrics.

Useful PromQL Examples and Dashboard Panels

  • p95 Request Latency:

    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
    
  • Error Rate Ratio:

    sum(rate(http_requests_total{job="app",status=~"5.."}[5m])) /
    sum(rate(http_requests_total{job="app"}[5m]))
    

Dashboard Reuse

Utilize Grafana variables to create templates (namespace, pod, node) so one dashboard can blend into multiple contexts. Consider utilizing community dashboards as a starting point, customizing them to suit your specific cluster and SLAs.


Common Troubleshooting Scenarios and How to Investigate

High Pod Restarts or CrashLoopBackOff

  • Use kubectl describe pod <pod> to check events and reasons.
  • Run kubectl logs <pod> -c <container> to examine container logs for OOM or exit messages.
  • Investigate the Prometheus metric: kube_pod_container_status_restarts_total and node memory metrics.
  • Possible Fixes: Adjust resource requests/limits, fix crashing code, or alter liveness/readiness probes.

Node Memory or Disk Pressure

  • Confirm node conditions using kubectl describe node <node>.
  • Evaluate metrics: node_memory_MemAvailable_bytes, node_filesystem_avail_bytes.
  • Possible Fixes: Drain the node, remove large files, increase node size, or adjust eviction thresholds.

API Server Errors and Slow Control Plane

  • Monitor apiserver metrics: apiserver_request_duration_seconds.
  • Validate etcd health and quorum (if running control plane).
  • Review kube-apiserver logs for authentication/authorization errors.

Missing Metrics or Scrape Failures

  • Access the Prometheus targets page to check for scrape errors.
  • Common issues include misconfigured RBAC, network policies blocking traffic, or label cardinality explosions.
  • Possible Fixes: Check ServiceAccount permissions, network policies, and PrometheusServiceMonitor/PodMonitor resources.

For a quick debugging reference, consult the Kubernetes resource usage monitoring tasks guide.


Security, Cost, and Operational Considerations

Security

  • Apply the principle of least privilege: create ServiceAccounts with only the necessary permissions for Prometheus or log forwarders.
  • Secure metrics endpoints, especially if they contain sensitive information.
  • Restrict Grafana access through authentication, role-based access, and limit dashboard sharing.

Cost Control

  • Define retention policies for metrics and logs.
  • Consider downsampling older metrics (using Thanos or Cortex) rather than retaining high-resolution data indefinitely.
  • Manage label cardinality to keep series counts at reasonable levels.

Scaling Across Clusters

  • Utilizing separate Prometheus instances per cluster simplifies monitoring; for centralized long-term storage and global queries, implement Thanos or Cortex when necessary.

Next Steps and Learning Resources

Beginner Checklist (Minimal Monitoring Setup):

  1. Install kube-prometheus-stack in a test cluster.
  2. Enable node-exporter and kube-state-metrics.
  3. Import a Grafana Cluster Overview dashboard and a Node Exporter dashboard.
  4. Configure at least two actionable alerts (instance_down, PodFrequentRestarts).
  5. Add a lightweight log forwarder (Fluent Bit) to send logs to Loki or a managed logging service.

Where to Go Next:

  • Instrument your applications using Prometheus client libraries and OpenTelemetry for tracing.
  • Define service-level objectives (SLOs) and error budgets.
  • Consider a plan for long-term storage and multi-cluster aggregation as your infrastructure grows.

Additional Reading and Official Docs:

If you’re using Kubernetes locally with Windows Subsystem for Linux (WSL), refer to the WSL Configuration Guide to optimize your environment. If you’re setting up a home lab, see our guidance on the hardware requirements.

For Windows-specific monitoring tasks (Windows nodes or containers), consult these resources:


Conclusion

To enhance your Kubernetes monitoring capabilities, follow the quickstart guide, set up kube-prometheus-stack in a test environment, and implement essential dashboards and alerts. Monitoring effectively helps ensure robust and reliable application performance.

TBO Editorial

About the Author

TBO Editorial writes about the latest updates about products and services related to Technology, Business, Finance & Lifestyle. Do get in touch if you want to share any useful article with our community.