Production Readiness Checklist¶
A checklist for reviewing Kubernetes workloads before promoting to production.
Workload Configuration¶
Pods & Containers¶
- Image tags — Use specific, immutable tags (e.g.,
v1.2.3or digestsha256:...). Never uselatestin production. - Resource requests and limits — Set CPU and memory requests/limits on every container.
- Liveness probe — Configured to restart unhealthy containers automatically.
- Readiness probe — Configured to prevent traffic from reaching unready pods.
- Startup probe — Added for slow-starting applications to avoid premature liveness failures.
- Security context — Run containers as non-root (
runAsNonRoot: true,runAsUser: <non-zero>). - Read-only root filesystem — Set
readOnlyRootFilesystem: truewhere possible. - Privilege escalation disabled — Set
allowPrivilegeEscalation: false. - Drop capabilities — Drop all capabilities and add only what is required.
- No privileged containers — Avoid
privileged: true.
# Example security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
Deployments & StatefulSets¶
- Replica count — Use at least 2 replicas for high availability.
- Pod Disruption Budget (PDB) — Define a PDB to limit disruption during node maintenance.
- Update strategy — Use
RollingUpdatewithmaxUnavailable: 0for zero-downtime deploys. - Rollback plan — Verify
revisionHistoryLimitis set (default 10) for rollback capability. - Pod anti-affinity — Spread replicas across nodes and/or availability zones.
# Example PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 1 # or maxUnavailable: 1
selector:
matchLabels:
app: my-app
Resource Management¶
- Namespace resource quotas — Define
ResourceQuotato cap namespace resource consumption. - LimitRange — Set default requests/limits for pods that don't specify them.
- Cluster capacity — Confirm nodes have sufficient headroom for rolling deploys and autoscaling.
- QoS class — Aim for
GuaranteedorBurstableQoS (avoidBestEffortfor critical workloads).
# Example ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: my-app
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "20"
services: "10"
Networking¶
- Services — Expose only what needs to be exposed; prefer
ClusterIPinternally. - Ingress — Use Ingress with TLS termination for HTTP(S) workloads.
- TLS certificates — Use cert-manager with valid certificates; do not use self-signed certs in production.
- Network Policies — Restrict pod-to-pod communication; apply least-privilege network rules.
- DNS — Validate service discovery via cluster DNS.
- External load balancer health checks — Confirm health check paths are correctly configured.
# Example NetworkPolicy — deny all ingress, allow only from same namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: my-app
spec:
podSelector: {}
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-same-namespace
namespace: my-app
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {}
Configuration & Secrets¶
- No secrets in image or source code — All secrets come from Kubernetes Secrets or an external secrets manager.
- Secrets encrypted at rest — Enable encryption at rest for the etcd datastore.
- Environment variables vs volume mounts — Prefer mounting secrets as volumes over environment variables for sensitive values.
- ConfigMaps for non-sensitive config — Separate configuration from application code.
- External Secrets Operator — Consider using ESO for syncing secrets from Vault, AWS Secrets Manager, etc.
- Secret rotation plan — Define how secrets are rotated and how pods pick up the new values.
Storage¶
- Persistent Volume Claims — Ensure PVCs are bound and using the appropriate StorageClass.
- StorageClass — Confirm the StorageClass reclaim policy (
RetainvsDelete) is correct. - Volume backups — Enable backups for stateful data (e.g., Velero).
- StatefulSet pod identity — Verify that StatefulSet pods use stable network identifiers and persistent volumes.
- ReadWriteMany volumes — Only use
RWXaccess modes if your storage backend supports it.
Observability¶
- Structured logging — Application logs are in JSON or another structured format.
- Log aggregation — Logs are collected by a tool such as Loki, Fluentd, or Fluent Bit.
- Metrics exposed — Application exposes a
/metricsendpoint compatible with Prometheus. - ServiceMonitor or PodMonitor — Prometheus scrape targets are configured.
- Dashboards — Grafana dashboards exist for key application metrics.
- Alerting rules — Alerts are defined for SLO breaches, error rates, and resource saturation.
- Tracing — Distributed tracing is configured (e.g., OpenTelemetry, Jaeger) for critical paths.
- Runbooks — Each alert links to a runbook explaining how to investigate and resolve.
Reliability & Autoscaling¶
- Horizontal Pod Autoscaler (HPA) — Configured for stateless workloads with appropriate min/max replicas.
- Vertical Pod Autoscaler (VPA) — Considered for right-sizing resource requests.
- Cluster Autoscaler — Enabled if running on a cloud provider with node pools.
- Graceful shutdown — Application handles
SIGTERMand finishes in-flight requests withinterminationGracePeriodSeconds. - PreStop hook — Add a sleep PreStop hook if the application needs time to drain connections.
- Chaos testing — Run chaos experiments (e.g., kill pods, drain nodes) to validate resilience.
# Example graceful shutdown
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
terminationMessagePolicy: FallbackToLogsOnError
Security¶
- RBAC — Least-privilege roles and service accounts for all components.
- ServiceAccount token auto-mount — Set
automountServiceAccountToken: falseunless the pod needs it. - Pod Security Standards — Apply the appropriate Pod Security Standard (
baselineorrestricted) to namespaces. - Image scanning — Container images are scanned for vulnerabilities before deployment.
- Image signing — Images are signed and verified (e.g., Cosign, Notary).
- Admission control — Policies enforced via OPA/Gatekeeper or Kyverno.
- Audit logging — Kubernetes API server audit logs are enabled and collected.
- etcd encryption — Sensitive Secret data is encrypted at rest in etcd.
- Node hardening — Worker nodes have CIS Kubernetes Benchmark controls applied.
CI/CD & Deployment¶
- GitOps / IaC — All manifests are stored in version control; no manual
kubectl apply. - Environment promotion — Changes flow through dev → staging → production with approvals.
- Automated testing — Integration and smoke tests run against staging before production promotion.
- Rollback procedure — Rollback steps are documented and tested.
- Deployment notifications — Team is notified on successful and failed deployments.
- Change freeze windows — Deployment schedule respects maintenance windows.
Cluster Operations¶
- Kubernetes version — Running a supported Kubernetes version; upgrade path is planned.
- Control plane HA — Multiple control plane nodes in production clusters.
- etcd backup — Automated, regularly tested etcd backups are in place.
- Node OS updates — Node OS patches are applied regularly via rolling node replacement.
- Namespace hygiene — Unused namespaces and resources are cleaned up.
- Cost visibility — Namespace or team cost allocation is visible (e.g., Kubecost).
- Disaster recovery plan — DR procedure is documented and tested at least annually.
Final Review¶
| Category | Status | Owner | Notes |
|---|---|---|---|
| Workload configuration | |||
| Resource management | |||
| Networking | |||
| Configuration & secrets | |||
| Storage | |||
| Observability | |||
| Reliability & autoscaling | |||
| Security | |||
| CI/CD & deployment | |||
| Cluster operations |