thinking about eevdf

Posted on:August 14, 2024 at 12:00 PM

This article is intended to be the first in a series exploring Linux schedulers. I think it’s a really exciting space right now, and feel that not enough people know about the cool little details about it.

Motivation

Really cool things are happening in the scheduling space right now. sched_ext, a method for writing scheduling policies in userspace using eBPF, is being merged into the mainline Linux kernel in 6.11. EEVDF was recently merged in 6.6. Scheduling is no longer as simple as it was 10 years ago, with Intel and AMD both starting to adopt non-homogenous architectures with performance and efficiency cores, as well as NUMA architectures being a thing now.

I was thinking about writing about this quite awhile back, but what really inspired me was this Tech Over Tea podcast with David Vernet, a sched_ext maintainer, where he mentioned in passing even he didn’t understand EEVDF fully.

I found this interesting because I was struggling to find good resources to learn a little bit more about the implementation of EEVDF in the linux kernel, and the only good resource I could find was this one-pager on LWN.

A bit about the scheduling lore

Linux schedulers were a hot topic of discussion in the past, and there’s a lot of lore about how the maintainers think about scheduling. I think this video summarizes it really well, as well as gives some context about the past schedulers in the kernel.

How to think about schedulers and fairness.

Schedulers, essentially are just programs to allocate CPU resources to tasks. These tasks, or processes, could be anything running on your computer, from a Python script to your web browser. In this case, the resource we want to fairly allocate is the amount of time each task gets on the CPU.

In a general purpose workload, we may have a few thousand processes and maybe anywhere from 4-16 CPU cores to work with. It then begs the question, how do we fairly distribute CPU time? What are the other concerns and heauristics we should care or think about when looking at schedulers?

Fairness and interactivity

The first important intuition to have: if all processes are given some sort of “equal” amount of time/resource, the scheduler is “fair”. This is really important because it ensures processes won’t be starved, or not be allocated any CPU resources over a long period of time.

The second intuition is this: we no longer live in a world where batch computing is the norm. We interact with computers with web browsers and interactive applications, and whether the system is responsive is a huge concern.

How EEVDF achieves fairness and prioritizes interactivity

Virtual runtime

EEVDF, and well as it’s predecessor, CFS rely on this concept of vruntime. This is the “current” virtualized time of a task. It is amount of CPU time given to a task. For example, given a task with niceness (think of niceness like priority levels) of 0.

const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

So, if a task of niceness of 0 runs for a 1000 time units, 1000 _ (1024/1024) = 1000 will be added to the tasks vruntime. If a task of niceness of -1 runs for 1000 time units, 1000 _ (1024/1277) = 802 will be added to the tasks vruntime.

This means, given that 1000 units of vruntime is added to a process of niceness 0 and niceness -1, the process with a niceness of -1, will receive approximately 10% more resources. Given that a CPU has a limited amount of resource to spread evenly between multiple tasks, this weighted vruntime system allows the kernel to have some notion of fairness, while also allowing higher priority tasks to be allocated more time.

Lag

Well, how does the scheduler ensure that only the tasks receives “equal” amounts of runtime? That’s where lag comes into play. The actual calculation of lag is a lot more complex, but the kernel simplifies this as just:

Process lag = Process current weighted vruntime - weighted average of every task’s vruntime

Given two tasks, A and B, with a vruntime of 10 and 30, both of them would have a lag of 10 and -10 respectively. We say that A has postive lag and B has negative lag. In EEVDF, only processes with positive lag are eligible to be scheduled. In other words, task B won’t be schedules until task A catches up. This is how we ensure fairness within EEVDF.

Deadlines

Well what about responsiveness or interactivity? That’s where the second part of EEVDF, which is thinking about deadlines.

Once again, the original paper on EEVDF goes on a much more detailed and complex explanation on how to derive the deadline, but being implemented in kernel space, this is simplified in the Linux kernel. Deadline is calculated as such:

Virtual deadline = eligible time + (time slice/weight)

The eligible time in the kernel would just be the current vruntime of the task. The key behind scheduling with EEVDF is that we don’t just take the task with the highest lag, but the task with the earliest deadline, hence the name, earliest eligible virtual deadline first.

The idea behind deadlines comes from the fact that interactive tasks are usually not very compute intensive, and thus, tend to yield early, so they will have better weights, lowering the virtual deadline, and allowing these tasks to execute before the compute-intensive tasks running in the background.

Here’s a quote from Igalia’s blog post that explains this very eloquently:

EEVDF scheduler naturally handles latency-critical tasks unlike CFS. A latency-critical task has a small latency nice value, so it has a shorter time slice. Therefore, its virtual deadline is earlier than non-latency critical tasks with longer slices. Hence, suppose two tasks have the same nice value with different latency nice values. The latency-critical task will be scheduled more frequently with a shorter slice.

How does this affect sched_ext?

Essentially, what sched_ext is doing is replacing the EEVDF scheduler for the most part. For a scheduler to be “good”, there must be some way these schedulers think about fairness in a way that is different from EEVDF. Both rusty and lavd from the scx repo have really interesting ideas about how we should also consider latency-critical tasks when scheduling, and how we can augment the calculation of the virtual deadline to reflect that.

You can read more here: https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_lavd/src/bpf/main.bpf.c

What’s next?

This hopefully serves as a foundation on how the modern scheduler in Linux works. I want to eventually dive into the world of sched_ext schedulers as this seems to be a very exciting field, but to do that, a good understanding of the current state of schedulers is required.