Monitoring and Logging¶
Duration: 45 minutes (20 minutes theory + 25 minutes lab)
Introduction¶
Observability is critical for production Kubernetes clusters. You need to monitor metrics, collect logs, and trace requests across services.
Three Pillars of Observability:
- Metrics - Numeric measurements over time (CPU, memory, request rate)
- Logs - Discrete events (errors, warnings, debug info)
- Traces - Request flow through distributed systems
Monitoring Stack¶
Popular Stack:
- Prometheus - Metrics collection and storage
- Grafana - Visualization and dashboards
- Loki - Log aggregation
- Jaeger/Tempo - Distributed tracing
Prometheus¶
Time-series database and monitoring system:
Architecture:
flowchart LR
Targets["Targets<br/>(exporters)"]
Prometheus["Prometheus<br/>(metrics)"]
Grafana["Grafana<br/>(dashboards)"]
Prometheus -->|scrape| Targets
Grafana -->|query| Prometheus
Key Concepts:
- Scraping - Prometheus pulls metrics from endpoints
- Targets - Services exposing /metrics endpoint
- ServiceMonitor - CRD defining what to scrape
- AlertManager - Handles alert routing
Installing Prometheus Stack¶
# Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
This installs:
- Prometheus Operator
- Prometheus server
- Grafana
- AlertManager
- Node Exporter
- Kube State Metrics
Prometheus Metrics¶
Four metric types:
- Counter - Only increases (requests_total)
- Gauge - Can increase or decrease (memory_usage)
- Histogram - Samples observations (request_duration)
- Summary - Similar to histogram, calculates quantiles
Exposing Metrics¶
Simple Go metrics endpoint:
import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())
Example metrics output:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 567
# HELP memory_usage_bytes Current memory usage
# TYPE memory_usage_bytes gauge
memory_usage_bytes 1048576000
ServiceMonitor¶
Tells Prometheus what to scrape:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-metrics
labels:
release: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
Corresponding Service:
apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
app: myapp
ports:
- name: http
port: 80
- name: metrics
port: 9090
PromQL Basics¶
Prometheus Query Language:
# Current value
http_requests_total
# Filter by labels
http_requests_total{method="GET"}
# Rate over time
rate(http_requests_total[5m])
# Aggregation
sum(rate(http_requests_total[5m])) by (method)
# Percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Alerting¶
PrometheusRule¶
Define alert conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: monitoring
spec:
groups:
- name: myapp
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
AlertManager Configuration¶
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
type: Opaque
stringData:
alertmanager.yaml: |
global:
slack_api_url: 'https://hooks.slack.com/services/XXX'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_KEY'
Grafana Dashboards¶
Access Grafana:
# Get Grafana password
kubectl get secret -n monitoring monitoring-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode
# Port-forward to Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
# Open http://localhost:3000
# Username: admin
# Password: (from above command)
Creating Dashboard¶
{
"dashboard": {
"title": "My Application",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m]))"
}
],
"type": "graph"
}
]
}
}
Loki for Logs¶
Loki is like Prometheus for logs - it indexes labels, not content.
Installing Loki¶
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set grafana.enabled=false
LogQL Queries¶
# All logs from namespace
{namespace="production"}
# Filter by pod
{namespace="production", pod="myapp-xxx"}
# Search content
{namespace="production"} |= "error"
# Regex filter
{namespace="production"} |~ "error|exception"
# Rate of logs
rate({namespace="production"}[5m])
# Count by level
sum(count_over_time({namespace="production"} |= "error" [5m])) by (pod)
Promtail Configuration¶
Promtail ships logs to Loki:
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
data:
promtail.yaml: |
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_container_name]
target_label: container
Distributed Tracing¶
Track requests across microservices:
Jaeger Architecture:
Installing Jaeger¶
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.42.0/jaeger-operator.yaml -n observability
# Create Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: allInOne
ingress:
enabled: false
allInOne:
image: jaegertracing/all-in-one:latest
options:
log-level: debug
storage:
type: memory
options:
memory:
max-traces: 100000
EOF
Kubernetes Metrics¶
Built-in metrics:
# Node metrics
kubectl top nodes
# Pod metrics
kubectl top pods
# Get metrics from metrics-server
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods
Best Practices¶
- Label consistently - Use standard labels (app, version, env)
- Alert on symptoms - Not causes (high error rate, not high CPU)
- Use recording rules - Pre-compute expensive queries
- Set retention - Balance storage vs history (15-30 days typical)
- Monitor the monitors - Alert if Prometheus is down
- Dashboard organization - Separate operational vs business metrics
- Log levels - Use appropriate levels (ERROR, WARN, INFO, DEBUG)
- Sampling - Sample traces in high-traffic scenarios
- Cardinality - Avoid high-cardinality labels (user IDs, timestamps)
- SLOs/SLIs - Define service level objectives
Common Metrics to Monitor¶
Infrastructure:
- Node CPU/Memory usage
- Pod restarts
- Disk usage
- Network I/O
Application:
- Request rate, errors, duration (RED method)
- Queue depth
- Database connection pool
- Cache hit rate
Business:
- User signups
- Transaction value
- Feature usage
Key takeaways¶
- Observability has three pillars — metrics, logs, and traces — each answering different questions about system health
- Prometheus scrapes metrics from targets on a pull-based model; Grafana visualises them
- ServiceMonitor is the Kubernetes-native way to tell Prometheus which services to scrape
- PromQL is a powerful query language for aggregating and alerting on time-series data
- Loki provides log aggregation without indexing full log content, keeping storage costs low
Check your understanding¶
- What are the three pillars of observability?
- How does Prometheus collect metrics from applications?
- What Kubernetes resource tells Prometheus which Pods to scrape?
- What is the difference between a Counter and a Gauge metric type?
- What tool would you use to visualise Prometheus metrics in dashboards?
Solution
- Metrics, logs, and traces
- Prometheus uses a pull-based model — it scrapes an HTTP
/metricsendpoint exposed by each target - ServiceMonitor (a Custom Resource Definition provided by the Prometheus Operator)
- A Counter only ever increases (e.g. total requests); a Gauge can increase or decrease (e.g. current memory usage)
- Grafana
Hands-on¶
Apply the concepts from this section in the lab exercises.
Next section¶
Once you've reviewed the content and completed the lab, proceed to the next section.