Running Jenkins in Kubernetes: Why We Left EC2 Behind

Scaling Jenkins agents dynamically in Kubernetes beats static EC2 instances. Here's what worked, what broke, and how we solved Docker-in-Docker nightmares with BuildKit.

Krishna C

March 24, 2021

•

4 min read

•

Updated July 30, 2023

TL;DR Moving Jenkins from EC2 to Kubernetes cut our build infrastructure costs by 60%. The key wins: agents scale to zero when idle, pods spawn in seconds instead of minutes, and BuildKit solved all our Docker-in-Docker problems.

We moved Jenkins from EC2 instances to Kubernetes. The promise was simple: spawn agents on demand, scale to zero when idle, stop paying for idle build capacity.

The reality was messier. But worth it.

Why Kubernetes Over EC2

Dynamic Agent Scaling: EC2 requires pre-provisioned instances. You pay for capacity whether builds are running or not. Kubernetes spawns agent pods on demand and terminates them when done. Actual usage determines cost.

Scale to Zero: No builds running? Agent pods go to zero. With EC2, you keep minimum instances running "just in case." That idle cost adds up fast.

Resource Efficiency: Kubernetes schedules pods across nodes intelligently. Multiple small builds pack onto the same node. Large builds get dedicated resources. EC2 forces you to guess instance sizes upfront.

Faster Agent Provisioning: Spinning up a pod takes seconds. Launching an EC2 instance takes minutes. When builds queue, speed matters.

No Instance Management: No SSH keys. No AMI updates. No security patches. Just container images. Jenkins handles the rest.

The Helm Chart Challenge

The official Jenkins Helm chart is comprehensive and complicated. Configuration sprawls across values files. Getting basic Jenkins running is easy. Getting it production-ready with proper persistence, security, and networking requires understanding Kubernetes internals.

We spent more time than expected just configuring the chart correctly. The defaults aren't terrible, but they're not production-ready either.

The Docker-in-Docker Problem

Building Docker images inside Kubernetes pods breaks the simple case. Jenkins agents need to build images, but pods can't run Docker daemons without privileged access. That's a security nightmare.

Attempt 1: Docker-in-Docker (DinD)

We started with DinD. Ran a Docker daemon as a sidecar container. Agents connected to it to build images.

Why it failed: Requires privileged pods. Opens security holes. Slow because daemon startup adds overhead to every build. Caching is complicated. Doesn't play well with ephemeral pods.

Attempt 2: Kaniko

Kaniko builds images without a Docker daemon. Runs unprivileged. Reads Dockerfiles, builds images, pushes to registries. All from within a standard pod.

Why we moved on: Worked well for simple Dockerfiles. Struggled with complex multi-stage builds. Cache management was awkward. Performance wasn't great for large builds. Felt like a workaround, not a solution.

Final Solution: BuildKit

BuildKit is Docker's next-gen build engine. Rootless mode, better caching, faster builds, proper multi-stage support.

Why it works: Runs rootless (no privileged pods). Cache management is excellent. Persistent volumes or inline caching work smoothly. Handles complex builds without issues. Performance matches or beats Docker daemon builds.

The tradeoff: Slightly more complex setup than Kaniko. Requires configuring BuildKit daemon mode or using buildctl directly. Worth it for production use.

Making Jenkins Ephemeral with EFS

Jenkins is stateful. Job configs, build history, plugins all live on disk. Running stateful services in Kubernetes means handling persistence properly.

AWS EFS (Elastic File System) solved this. Network file system accessible from any node. Jenkins pod dies? Reschedules on another node with the same data. Mount EFS to /var/jenkins_home and the controller becomes truly ephemeral. Upgrade? Delete the pod. New one starts with identical state.

What We Got

Cost Savings: Agents scale to zero overnight and weekends. Cut build infrastructure costs by ~60%.

Better Resource Utilization: Multiple builds share nodes efficiently. No more wasted large instances for small builds.

Faster Feedback: Builds start in seconds. Kubernetes pod scheduling beats EC2 instance launches.

Simpler Operations: No instance fleet management. Container images replace SSH configuration.

The Real Costs

Initial Setup Complexity: Getting Jenkins + Kubernetes + BuildKit + EFS working took time. The Helm chart isn't plug-and-play.

Learning Curve: Debugging moved from SSH to pod logs and Kubernetes events.

Build Scripts Changed: Pipelines needed updates for BuildKit's rootless mode.

Monitoring: Prometheus and Grafana handle metrics and dashboards. Different from EC2 CloudWatch but more flexible.

Worth It?

Yes. The initial complexity pays off in operational simplicity and cost savings.

If you're running Jenkins on static EC2 with variable build loads, Kubernetes makes sense. Scaling to zero alone justifies the migration.

Budget time for BuildKit configuration and pipeline testing before going to production.

Thoughts? Hit me up at [email protected]

← Previous

Building MLOps Infrastructure: MLFlow, Airflow, and Ray on Kubernetes

Production MLOps stack on AWS EKS. MLFlow experiment tracking, Airflow orchestration, Ray distributed training, and Evidently AI drift monitoring.

Metadata-Driven Merge: A Declarative Approach to Data Integration

Building a lightweight alternative to GraphQL for hierarchical data merging using Go, with concurrent fetching and configurable merge strategies.