This post isn’t a tutorial on how to write a Kubernetes scheduler; it’s a reminder to myself that software isn’t magic and sometimes projects that sound really challenging are achievable with some persistence. I’m proud of this project, even though I eventually chose a different approach to the problem I was trying to address, and I hope that it’ll serve as a useful experience report for anyone thinking of building a scheduler plugin (or for the sig-scheduling maintainers). This post assumes some familiarity with Kubernetes pods, nodes, and the concept of scheduling.
I previously worked on the Tekton Pipelines project, a CI/CD platform built on Kubernetes. Tekton users build CI/CD Pipelines and run them via PipelineRuns. Under the hood, a PipelineRun is usually run in multiple Kubernetes pods, and uses Kubernetes PersistentVolumes for storage. The way Kubernetes schedules PipelineRun pods and provisions PipelineRun storage often created frustrating UX issues for Tekton users, and I hoped to address this by ensuring all the pods would be scheduled to the same node where the storage was provisioned.
While the Kubernetes API provides several ways of controlling scheduling, there’s currently no supported way to forcibly run a group of pods on the same node. There’s an existing “coscheduling” scheduler plugin that aims to meet this need, so I decided to experiment with a similar plugin to see if it could be adapted to my use case.
In a nutshell, the Kubernetes scheduler works as follows:
Kubernetes has a scheduling framework that allows plugins to register at multiple points during the scheduling process. My plugin registered two extension points:
I first used a “PreFilter” extension point, which allows a scheduler to pre-process pod info at the beginning of a scheduling cycle and return a set of “candidate” nodes for filtering. When a new pod was ready for scheduling, my PreFilter plugin determined whether any node was already running pods associated with the same Tekton PipelineRun, and if so, returned that node as the only valid candidate.
Next, my scheduler used a “Filter” extension point during the scheduling cycle to determine which of the candidate nodes were suitable for running the pod. My Filter plugin optionally filtered out any nodes already running pods for other Tekton PipelineRuns, depending on its configuration.
This worked and was surprisingly simple, with less than 200 lines of code. The Kubernetes “scheduler-plugins” repo has a number of helpful examples that were very useful.
On local clusters, scheduler plugins can replace the default scheduler, or run in parallel with it as a second scheduler. However, I wanted to deploy my plugin on GKE, which doesn’t support replacing the default scheduler, so I had to use multiple schedulers. In addition to the official documentation on running multiple schedulers, the following blog posts provided some starting examples of scheduler plugin configuration:
I created a deployment to run my second scheduler, and mounted its configuration via a configmap. Scheduler plugin configuration is poorly documented, so while the blog posts I found served as helpful starting examples, I wasn’t able to use them exactly as written. I found the following debugging strategies useful in creating a working deployment:
kubectl get events --field-selector involvedObject.name=$POD_NAME
or kubectl describe pod $POD_NAME
to get events associated with a specific pod. For example, the following events are associated with a pod handled by the GKE default scheduler:Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 52m default-scheduler Successfully assigned default/catalog-publish-trigger-tekton-upstream-28250970-tghlt to gke-dogfooding-default-pool-f62aa79c-94oa
I0302 19:24:21.146841 1 configfile.go:105] "Using component config" config=<
apiVersion: kubescheduler.config.k8s.io/v1
... # Truncated
profiles:
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: DefaultPreemptionArgs
minCandidateNodesAbsolute: 100
minCandidateNodesPercentage: 10
name: DefaultPreemption
... # Truncated
plugins:
multiPoint:
enabled:
- name: PrioritySort
weight: 0
... # Truncated
- name: DefaultBinder
weight: 0
- name: OneNodePerPipelineRun
weight: 0
... # Truncated
schedulerName: one-node-per-pipelineRun
The scheduler’s logs tended to lead me down the wrong path at least as often as they led me down the right one, so they were of limited use as a debugging tool.
The difficulty of debugging, and the poor tools and documentation available for doing so, were one of the major factors that led me to decide not to use this plugin.
This scheduler worked well as a prototype, but I eventually decided that the scheduler framework didn’t feel production-ready enough to use in, well… production, for several reasons:
In addition, a scheduler plugin would likely have been hard for Kubernetes cluster operators to install, and might need to be customized for different Kubernetes implementations.
Instead of using a scheduler plugin, we chose to build coscheduling logic directly into Tekton. When a PipelineRun is created, Tekton creates a balloon pod first: a placeholder, no-op pod intended to “anchor” the PipelineRun to the node. Tekton then adds inter-pod affinity to the other PipelineRun pods to ensure they are scheduled on the same node. The balloon pod is created via a statefulset, which is also responsible for managing the storage used by the PipelineRun. If you’re interested in the full design proposal, it’s available here.