Building Production-Ready Kubernetes Clusters

Running Kubernetes in production is vastly different from running it in development. After managing production clusters for years, here's what actually matters.

The Hard Truths

Let me start with something that took me years to learn: Kubernetes is not your infrastructure. It's a platform for running your infrastructure. This mindset shift is critical.

1# Don't do this in production
2apiVersion: v1
3kind: Pod
4metadata:
5  name: my-app
6spec:
7  containers:
8    - name: app
9      image: myapp:latest # ❌ Never use 'latest'
10      resources: {} # ❌ No resource limits

Production Checklist

1. Resource Management

Always set resource requests and limits. Always.

1resources:
2  requests:
3    memory: "256Mi"
4    cpu: "250m"
5  limits:
6    memory: "512Mi"
7    cpu: "500m"

⚠️

Without resource limits, one misbehaving pod can take down your entire node. I've seen this happen at 2 AM more times than I'd like to admit.

2. High Availability

Run multiple replicas across availability zones:

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: web-app
5spec:
6  replicas: 3
7  template:
8    spec:
9      affinity:
10        podAntiAffinity:
11          requiredDuringSchedulingIgnoredDuringExecution:
12            - labelSelector:
13                matchExpressions:
14                  - key: app
15                    operator: In
16                    values:
17                      - web-app
18              topologyKey: topology.kubernetes.io/zone

3. Health Checks

Implement proper liveness and readiness probes:

1livenessProbe:
2  httpGet:
3    path: /health/live
4    port: 8080
5  initialDelaySeconds: 30
6  periodSeconds: 10
7
8readinessProbe:
9  httpGet:
10    path: /health/ready
11    port: 8080
12  initialDelaySeconds: 5
13  periodSeconds: 5

Security Hardening

Network Policies

Deny all traffic by default, then explicitly allow:

1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4  name: deny-all
5spec:
6  podSelector: {}
7  policyTypes:
8    - Ingress
9    - Egress

Pod Security Standards

Enable Pod Security Admission:

1apiVersion: v1
2kind: Namespace
3metadata:
4  name: production
5  labels:
6    pod-security.kubernetes.io/enforce: restricted
7    pod-security.kubernetes.io/audit: restricted
8    pod-security.kubernetes.io/warn: restricted

🚨

Security is not optional. A compromised container can become a foothold for attackers. Defense in depth is your friend.

Observability

You can't fix what you can't see. Set up proper monitoring from day one.

Key Metrics to Track

Node metrics: CPU, memory, disk I/O
Pod metrics: Restart count, OOM kills
Application metrics: Request latency, error rates
Cluster metrics: API server latency, etcd performance

1# Prometheus ServiceMonitor
2apiVersion: monitoring.coreos.com/v1
3kind: ServiceMonitor
4metadata:
5  name: app-metrics
6spec:
7  selector:
8    matchLabels:
9      app: web-app
10  endpoints:
11    - port: metrics
12      interval: 30s

Disaster Recovery

Backup Strategy

etcd backups: Every 6 hours minimum
PV snapshots: Depends on your workload
GitOps: Your cluster config should be in Git

1# Automated etcd backup
2ETCDCTL_API=3 etcdctl snapshot save \
3  --endpoints=https://127.0.0.1:2379 \
4  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
5  --cert=/etc/kubernetes/pki/etcd/server.crt \
6  --key=/etc/kubernetes/pki/etcd/server.key \
7  /backup/etcd-$(date +%Y%m%d-%H%M%S).db

Test Your Backups

ℹ️

A backup you've never restored is just a file taking up space. Schedule quarterly DR drills.

Cost Optimization

Right-Sizing

Most teams over-provision resources by 2-3x. Use Vertical Pod Autoscaler to get recommendations:

1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4  name: my-app-vpa
5spec:
6  targetRef:
7    apiVersion: "apps/v1"
8    kind: Deployment
9    name: my-app
10  updateMode: "Recommend" # Start with recommendations

Cluster Autoscaling

Enable cluster autoscaler for node pools:

1# GKE example
2gcloud container clusters update CLUSTER_NAME \
3--enable-autoscaling \
4--min-nodes 3 \
5--max-nodes 10 \
6--node-pool default-pool

Lessons from Production Incidents

Incident 1: The OOMKill Cascade

What happened: One pod hit its memory limit, got OOMKilled, restarted, immediately hit the limit again. Loop continues, taking down the service.

Fix: Proper resource requests + limits + readiness probes

Incident 2: Certificate Expiry

What happened: TLS certificates expired on a Saturday. Cluster API server became unreachable.

Fix: Automated certificate rotation with cert-manager + monitoring

1apiVersion: cert-manager.io/v1
2kind: Certificate
3metadata:
4  name: api-server-cert
5spec:
6  secretName: api-server-tls
7  duration: 2160h # 90 days
8  renewBefore: 360h # 15 days before expiry
9  issuerRef:
10    name: letsencrypt-prod
11    kind: ClusterIssuer

Incident 3: Single AZ Failure

What happened: AWS AZ went down, taking 70% of our pods with it.

Fix: Pod topology spread constraints across zones

The Reality Check

Here's what nobody tells you about running Kubernetes:

It's complex - Embrace it or use a managed service
It requires dedicated expertise - Budget for it
It's not cheaper - It's more flexible, not necessarily cheaper
Logging is painful - Plan for it early
Upgrades are stressful - Test thoroughly in staging

Key Takeaways

✅ Always set resource limits
✅ Implement health checks properly
✅ Secure your cluster from day one
✅ Monitor everything
✅ Test your backups regularly
✅ Plan for node failures
✅ Automate certificate management
✅ Use GitOps for cluster management

Final Thoughts

Kubernetes is incredibly powerful, but it's not magic. It requires careful planning, ongoing maintenance, and a solid understanding of distributed systems.

Start simple. Add complexity only when needed. And remember: the best Kubernetes cluster is the one you don't have to think about.

Questions? Reach out on Twitter or email.

Building Production-Ready Kubernetes Clusters: A Complete Guide