Production Readiness Checklist¶

A checklist for reviewing Kubernetes workloads before promoting to production.

Workload Configuration¶

Pods & Containers¶

# Example security context
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

Deployments & StatefulSets¶

Replica count — Use at least 2 replicas for high availability.
Pod Disruption Budget (PDB) — Define a PDB to limit disruption during node maintenance.
Update strategy — Use RollingUpdate with maxUnavailable: 0 for zero-downtime deploys.
Rollback plan — Verify revisionHistoryLimit is set (default 10) for rollback capability.
Pod anti-affinity — Spread replicas across nodes and/or availability zones.

# Example PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 1      # or maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Resource Management¶

Namespace resource quotas — Define ResourceQuota to cap namespace resource consumption.
LimitRange — Set default requests/limits for pods that don't specify them.
Cluster capacity — Confirm nodes have sufficient headroom for rolling deploys and autoscaling.
QoS class — Aim for Guaranteed or Burstable QoS (avoid BestEffort for critical workloads).

# Example ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: my-app
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "20"
    services: "10"

Networking¶

Services — Expose only what needs to be exposed; prefer ClusterIP internally.
Ingress — Use Ingress with TLS termination for HTTP(S) workloads.
TLS certificates — Use cert-manager with valid certificates; do not use self-signed certs in production.
Network Policies — Restrict pod-to-pod communication; apply least-privilege network rules.
DNS — Validate service discovery via cluster DNS.
External load balancer health checks — Confirm health check paths are correctly configured.

# Example NetworkPolicy — deny all ingress, allow only from same namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: my-app
spec:
  podSelector: {}
  policyTypes:
    - Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: my-app
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}

Configuration & Secrets¶

No secrets in image or source code — All secrets come from Kubernetes Secrets or an external secrets manager.
Secrets encrypted at rest — Enable encryption at rest for the etcd datastore.
Environment variables vs volume mounts — Prefer mounting secrets as volumes over environment variables for sensitive values.
ConfigMaps for non-sensitive config — Separate configuration from application code.
External Secrets Operator — Consider using ESO for syncing secrets from Vault, AWS Secrets Manager, etc.
Secret rotation plan — Define how secrets are rotated and how pods pick up the new values.

Storage¶

Persistent Volume Claims — Ensure PVCs are bound and using the appropriate StorageClass.
StorageClass — Confirm the StorageClass reclaim policy (Retain vs Delete) is correct.
Volume backups — Enable backups for stateful data (e.g., Velero).
StatefulSet pod identity — Verify that StatefulSet pods use stable network identifiers and persistent volumes.
ReadWriteMany volumes — Only use RWX access modes if your storage backend supports it.

Observability¶

Structured logging — Application logs are in JSON or another structured format.
Log aggregation — Logs are collected by a tool such as Loki, Fluentd, or Fluent Bit.
Metrics exposed — Application exposes a /metrics endpoint compatible with Prometheus.
ServiceMonitor or PodMonitor — Prometheus scrape targets are configured.
Dashboards — Grafana dashboards exist for key application metrics.
Alerting rules — Alerts are defined for SLO breaches, error rates, and resource saturation.
Tracing — Distributed tracing is configured (e.g., OpenTelemetry, Jaeger) for critical paths.
Runbooks — Each alert links to a runbook explaining how to investigate and resolve.

Reliability & Autoscaling¶

Horizontal Pod Autoscaler (HPA) — Configured for stateless workloads with appropriate min/max replicas.
Vertical Pod Autoscaler (VPA) — Considered for right-sizing resource requests.
Cluster Autoscaler — Enabled if running on a cloud provider with node pools.
Graceful shutdown — Application handles SIGTERM and finishes in-flight requests within terminationGracePeriodSeconds.
PreStop hook — Add a sleep PreStop hook if the application needs time to drain connections.
Chaos testing — Run chaos experiments (e.g., kill pods, drain nodes) to validate resilience.

# Example graceful shutdown
containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]
    terminationMessagePolicy: FallbackToLogsOnError

Security¶

RBAC — Least-privilege roles and service accounts for all components.
ServiceAccount token auto-mount — Set automountServiceAccountToken: false unless the pod needs it.
Pod Security Standards — Apply the appropriate Pod Security Standard (baseline or restricted) to namespaces.
Image scanning — Container images are scanned for vulnerabilities before deployment.
Image signing — Images are signed and verified (e.g., Cosign, Notary).
Admission control — Policies enforced via OPA/Gatekeeper or Kyverno.
Audit logging — Kubernetes API server audit logs are enabled and collected.
etcd encryption — Sensitive Secret data is encrypted at rest in etcd.
Node hardening — Worker nodes have CIS Kubernetes Benchmark controls applied.

CI/CD & Deployment¶

GitOps / IaC — All manifests are stored in version control; no manual kubectl apply.
Environment promotion — Changes flow through dev → staging → production with approvals.
Automated testing — Integration and smoke tests run against staging before production promotion.
Rollback procedure — Rollback steps are documented and tested.
Deployment notifications — Team is notified on successful and failed deployments.
Change freeze windows — Deployment schedule respects maintenance windows.

Cluster Operations¶

Kubernetes version — Running a supported Kubernetes version; upgrade path is planned.
Control plane HA — Multiple control plane nodes in production clusters.
etcd backup — Automated, regularly tested etcd backups are in place.
Node OS updates — Node OS patches are applied regularly via rolling node replacement.
Namespace hygiene — Unused namespaces and resources are cleaned up.
Cost visibility — Namespace or team cost allocation is visible (e.g., Kubecost).
Disaster recovery plan — DR procedure is documented and tested at least annually.

Final Review¶

Category	Status	Owner	Notes
Workload configuration
Resource management
Networking
Configuration & secrets
Storage
Observability
Reliability & autoscaling
Security
CI/CD & deployment
Cluster operations