Building Production-Ready Kubernetes Clusters: A Complete Guide
Learn how to set up, secure, and operate Kubernetes clusters in production with battle-tested best practices from the trenches.
Building Production-Ready Kubernetes Clusters
Running Kubernetes in production is vastly different from running it in development. After managing production clusters for years, here's what actually matters.
The Hard Truths
Let me start with something that took me years to learn: Kubernetes is not your infrastructure. It's a platform for running your infrastructure. This mindset shift is critical.
1# Don't do this in production 2apiVersion: v1 3kind: Pod 4metadata: 5 name: my-app 6spec: 7 containers: 8 - name: app 9 image: myapp:latest # ❌ Never use 'latest' 10 resources: {} # ❌ No resource limits
Production Checklist
1. Resource Management
Always set resource requests and limits. Always.
1resources: 2 requests: 3 memory: "256Mi" 4 cpu: "250m" 5 limits: 6 memory: "512Mi" 7 cpu: "500m"
Without resource limits, one misbehaving pod can take down your entire node. I've seen this happen at 2 AM more times than I'd like to admit.
2. High Availability
Run multiple replicas across availability zones:
1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: web-app 5spec: 6 replicas: 3 7 template: 8 spec: 9 affinity: 10 podAntiAffinity: 11 requiredDuringSchedulingIgnoredDuringExecution: 12 - labelSelector: 13 matchExpressions: 14 - key: app 15 operator: In 16 values: 17 - web-app 18 topologyKey: topology.kubernetes.io/zone
3. Health Checks
Implement proper liveness and readiness probes:
1livenessProbe: 2 httpGet: 3 path: /health/live 4 port: 8080 5 initialDelaySeconds: 30 6 periodSeconds: 10 7 8readinessProbe: 9 httpGet: 10 path: /health/ready 11 port: 8080 12 initialDelaySeconds: 5 13 periodSeconds: 5
Security Hardening
Network Policies
Deny all traffic by default, then explicitly allow:
1apiVersion: networking.k8s.io/v1 2kind: NetworkPolicy 3metadata: 4 name: deny-all 5spec: 6 podSelector: {} 7 policyTypes: 8 - Ingress 9 - Egress
Pod Security Standards
Enable Pod Security Admission:
1apiVersion: v1 2kind: Namespace 3metadata: 4 name: production 5 labels: 6 pod-security.kubernetes.io/enforce: restricted 7 pod-security.kubernetes.io/audit: restricted 8 pod-security.kubernetes.io/warn: restricted
Security is not optional. A compromised container can become a foothold for attackers. Defense in depth is your friend.
Observability
You can't fix what you can't see. Set up proper monitoring from day one.
Key Metrics to Track
- Node metrics: CPU, memory, disk I/O
- Pod metrics: Restart count, OOM kills
- Application metrics: Request latency, error rates
- Cluster metrics: API server latency, etcd performance
1# Prometheus ServiceMonitor 2apiVersion: monitoring.coreos.com/v1 3kind: ServiceMonitor 4metadata: 5 name: app-metrics 6spec: 7 selector: 8 matchLabels: 9 app: web-app 10 endpoints: 11 - port: metrics 12 interval: 30s
Disaster Recovery
Backup Strategy
- etcd backups: Every 6 hours minimum
- PV snapshots: Depends on your workload
- GitOps: Your cluster config should be in Git
1# Automated etcd backup 2ETCDCTL_API=3 etcdctl snapshot save \ 3 --endpoints=https://127.0.0.1:2379 \ 4 --cacert=/etc/kubernetes/pki/etcd/ca.crt \ 5 --cert=/etc/kubernetes/pki/etcd/server.crt \ 6 --key=/etc/kubernetes/pki/etcd/server.key \ 7 /backup/etcd-$(date +%Y%m%d-%H%M%S).db
Test Your Backups
A backup you've never restored is just a file taking up space. Schedule quarterly DR drills.
Cost Optimization
Right-Sizing
Most teams over-provision resources by 2-3x. Use Vertical Pod Autoscaler to get recommendations:
1apiVersion: autoscaling.k8s.io/v1 2kind: VerticalPodAutoscaler 3metadata: 4 name: my-app-vpa 5spec: 6 targetRef: 7 apiVersion: "apps/v1" 8 kind: Deployment 9 name: my-app 10 updateMode: "Recommend" # Start with recommendations
Cluster Autoscaling
Enable cluster autoscaler for node pools:
1# GKE example 2gcloud container clusters update CLUSTER_NAME \ 3--enable-autoscaling \ 4--min-nodes 3 \ 5--max-nodes 10 \ 6--node-pool default-pool
Lessons from Production Incidents
Incident 1: The OOMKill Cascade
What happened: One pod hit its memory limit, got OOMKilled, restarted, immediately hit the limit again. Loop continues, taking down the service.
Fix: Proper resource requests + limits + readiness probes
Incident 2: Certificate Expiry
What happened: TLS certificates expired on a Saturday. Cluster API server became unreachable.
Fix: Automated certificate rotation with cert-manager + monitoring
1apiVersion: cert-manager.io/v1 2kind: Certificate 3metadata: 4 name: api-server-cert 5spec: 6 secretName: api-server-tls 7 duration: 2160h # 90 days 8 renewBefore: 360h # 15 days before expiry 9 issuerRef: 10 name: letsencrypt-prod 11 kind: ClusterIssuer
Incident 3: Single AZ Failure
What happened: AWS AZ went down, taking 70% of our pods with it.
Fix: Pod topology spread constraints across zones
The Reality Check
Here's what nobody tells you about running Kubernetes:
- It's complex - Embrace it or use a managed service
- It requires dedicated expertise - Budget for it
- It's not cheaper - It's more flexible, not necessarily cheaper
- Logging is painful - Plan for it early
- Upgrades are stressful - Test thoroughly in staging
Key Takeaways
✅ Always set resource limits
✅ Implement health checks properly
✅ Secure your cluster from day one
✅ Monitor everything
✅ Test your backups regularly
✅ Plan for node failures
✅ Automate certificate management
✅ Use GitOps for cluster management
Final Thoughts
Kubernetes is incredibly powerful, but it's not magic. It requires careful planning, ongoing maintenance, and a solid understanding of distributed systems.
Start simple. Add complexity only when needed. And remember: the best Kubernetes cluster is the one you don't have to think about.