Linux Kernel Pillow Talk

By Moshe Bar

October 29, 2001

And you thought the netherworlds of dry kernel engineering were free of politics, egos, and prima-donnas? Guess again. The events of the last four to six weeks and the e-mails flying to and from the Linux kernel mailing list show how Byzantine and complex the dynamics of decision finding, features design, and implementations can be. Go to http://www.tux.org/lkml/ to subscribe to the kernel mailing list, but be careful: This is a very high-traffic list. Subscribe only if you really want to follow every single detail of the Linux kernel, or instead read the weekly digest at Linux Kernel Cousin at http://kt.zork.net/kernel-traffic.

Sure, the lively debates have always existed. In the past there have been disputes about the Linux firewalling code, networking code, scheduler, installer, driver model, and many more. One recurrent theme has always been the Virtual Memory (VM) manager. Nothing determines the peculiar behavior, the feel — even the ultimate success or failure of an operating system — like its virtual memory design. Sometime during the development cycle leading up to the Linux 2.4.0 kernel, in other words in 2.3.xx times, Rik Van Riel (http://www.surriel.com/), a Dutch kernel hacker working for Brazil-based Conectiva (one of the smaller Linux distributions), introduced a radically new VM code. It was based on what seemed to be new and advanced algorithms for efficient finding, allocation, and disposal of virtual memory pages requested by programs. Rik later introduced an interesting new kernel feature called the "OOM killer." OOM stands for Out Of Memory. The OOM killer attempts to locate a killable process when memory runs out in the system. Without such a feature the whole machine can go nuts or enter a vicious cycle of swapping out a few pages, realizing immediately after that those pages are needed, and searching again for swappable page candidates, keeping the kernel busy doing only this instead of letting user processes run.

Rik is a gifted hacker, and among other things he has been trying to improve the efficiency and speed of maintenance of those lists in the kernel responsible for managing all the virtual memory pages in the system. One of the main questions to address in every operating system VM code is: "How do you choose which page to steal next when there is a RAM shortage?"

In the 2.4.0 release, the Linux kernel scans the process page and decides which page to remove. The problem with this approach is that sometimes a lot of process tables have to be scanned to free just one page, or very few pages. Also, this approach does not guarantee that the pages stolen are only those that will not be needed again very soon. Some UNIXes introduced the notion of the working set; that is, the minimum amount required by a process to function efficiently. This solution is, however, limited to per-process pages only and does not consider other kinds of pages, such as filesystem caching. Stealing from these pages might in some cases even prove counter-productive. Very often in VM theory, a solution to one problem can worsen another; that's why kernel programming is difficult.

Rik van Riel and I have variously discussed another approach, called "reverse mapping," which implements a reverse-lookup between the page and process table. Once you have reverse-mapped pages, the VM can simply scan the pages for the ones to be freed. Naturally, some extra fields need to be added to the appropriate control tables to allow this reverse mapping. My own implementation has an overhead of 14 bytes and is therefore certainly a lesser solution than Rik's — his overhead is just 8 bytes.

Other extremely talented kernel hackers such as Marcelo Tosati and Ben LaHaise have made other important contributions to the Linux VM.

However, even though all these intelligent people tried hard to make the Linux VM fast, efficient, and powerful, user reports since the 2.4.0 release indicated poor Linux kernel performance and erratic and unstable behaviors. Up to kernel 2.4.7, for instance, on machines with small memory footprints (less than 40-MB RAM), sudden swap storms could erupt which would virtually freeze the system while it inexplicably started swapping pages in out and like crazy. In some cases, the aforementioned OOM Killer would choose the wrong process to kill; I have seen the all-important init process killed erroneously. Many fringe kernel projects, like my own Mosix project or others such as Win4Lin, suffered because users accused these projects of unstable operations, assuming that a released kernel like 2.4.0 must be free of such nasty bugs. Even though the kernel had gradually evolved from 2.4.0 to 2.4.9, it was evident that the VM design was more of a liability than an advantage.

Linus himself said in a recent kernel list mailing that he wasn't happy yet with the VM. These problems were enough for many Linux shops to resist the migration to the 2.4 kernels and instead continue using the 2.2.19 kind of kernels. Obviously, compared to 2.4., the 2.2. series has many shortcomings — like no zero-copy networking, the division of page cache and buffer-cache in filesystem operations, big spinlocks (serializations of kernel execution paths for computers with more than one CPU) for many parts of the kernel, and so on.

A simple C program like the one below shows how kernels up to 2.4.9 had problems dealing with stress workloads on the VM system. If, after running this program, you turned the swap partition off with swapoff, your server or workstation would become totally unresponsive for up to 15 minutes.

/* based on a code originally proposed by Andrew Tanenbaum,
later by Derek Glidden and many others since */ 
#include  
void main(void) { 

/* in the next line we allocate 200MB, but since the virtual
memory page is not actually allocated by the kernel until we
use it, we also have to create an access to. The amount of
allocated pages should reflect the total RAM on your
computer. This test runs well with machines of, say, 256MB */

void *p = (void *)calloc(50000000, sizeof(int)) ; 
/* In the next line we let the system calm down a bit after 
   allocating pages*/ 
sleep(12); 
/*  and now re release it all again */ 
free(p); 
} 

Back in February 2001, I ran an informal and unscientific benchmark comparing FreeBSD 4.1.1 to kernel 2.4.0 (visit http://www.byte.com/documents/s=558/byt20010130s0010/) on exactly the same hardware and with exactly the same subsystems versions (MySQL, Sendmail, Apache, and others). The results clearly showed that, indeed, there were major problems with the efficiency and speed of the early 2.4 kernels.

A New VM

Then, on September 24, with the kernel standing at version 2.4.9, everything suddenly changed. Andrea Arcangeli, an Italian kernel hacker (read my interview with him two years ago at http://www.byte.com/documents/s=287/byt20000229s0008/) and a very prolific contributor, decided that enough was enough. He sat down and in one of those marathon hacking bouts completely rebuilt the VM from scratch. In short succession he sent to Linus Torvalds over 150 patches to the 2.4.9 kernel, to implement a new VM engine. This is an extremely remarkable feat. A VM is a major piece of software and by nature very complex. One needs to satisfy many opposed objectives: Simultaneously efficient handling for server-type loads and interactive-type loads; ease of implementation and at the same time, optimized use of every last and small feature of the CPU. The VM must also be able to run well on Intel CPUs spanning 4 or 5 generations, as well as on AMD chips, Alphas, MIPSes, Sparcs, ARMs, and what have you. Andrea, by the way, does all his development on a Compaq AlphaServer with 2 500-MHz CPUs and 3-GB RAM.

Out of the blue, Linus accepted the new VM and incorporated it into the official Linux kernel tree.

Recently, I spent two days with Andrea giving speeches. During the two days, over many bottles of beer, we had plenty of time to discuss his new VM. I was mainly interested in how the new VM affects Mosix. Because Mosix must migrate virtual memory pages belonging to the program's address spaces between cluster nodes, it is important to correctly understand the VM and interface efficiently to it.

Specifically, Andrea took exception to the following problems in the 2.4 VM:

The new VM is much simpler and faster. Let me explain how it works.

The old 2.4 VM had a major design problem that manifested itself mainly when freeing physically dirty pages (remember dirty pages are the frames of 4-KB memory in the RAM whose contents have been modified by one of the virtual memory pages residing in it). The last owner of the page (usually the VM, except in swapoff) has to clear the dirty flag before freeing the page. When being swapped off in swapoff it may be a little more complicated — we may need to grab the pagecache_lock to ensure nobody starts using the page while we clear it.

So, Andrea went and did the following: All physical pages are now divided into active and inactive pages. These two are further divided into dirty and clean for both active and inactive. When the active dirty pages become about 66 percent of the total number of pages, the VM starts to scan them for the oldest ones to be put into inactive dirty and then, later still, from there to the swap when memory becomes tight. This part is very central to the new VM and its simplicity is...well, simply stunning.

This elegant mechanism totally changes the behavior of the 2.4.10 kernel under heavy load and also makes for much better predictability of the system. Another very important change is that the swap is now additional to the RAM, just like in 2.2 times. All earlier 2.4 kernels (since 2.3.12) needed at least the same amount of RAM in swap and then more to give you additional virtual memory. This meant that on an 8-GB server, you needed to put aside almost a full 9-GB disk just to be able to swap, similar to some versions of Solaris or other UNIXes.

Finally, the page scanner doesn't page scan if there are theoretically no freeable pages, whereas before it did. Oh, and the OOM killer never really worked, so Andrea disabled it, as I did for all my kernels. In 2.4.12 it is enabled again; this time, however, it works much better. Try it with the above program to see it in action.

Arcangeli's VM is stable, acts predictably — something that the old VM never really achieved — and it makes the swap space look like it did in 2.2 days. Additionally, the design is much simpler and easier to understand. People will catch up fast with it.

However, many kernel hackers disagree. Upon the release of kernel 2.4.10, a virulent and sometimes aggressive debate flamed up, with many people trying to show why one of the two was a good VM and the other not. Some comments got a bit out of control, and only in the last two weeks or so has some calm been restored.

However, one nasty side effect stays. Alan Cox, the number two man after Linus Torvalds, does not yet like the new VM and in his own kernel tree (called the "ac tree") he still continues to use and patch the old VM. As a consequence, users and system administrators now find themselves facing two very different kernel trees to choose from: the official Linux tree and the Alan Cox tree. Quite often, latest patches to drivers and new features are only in Alan Cox's tree. Those who want to go with the official Linux source code may find themselves unable to apply the patches due to the different VM code all over. It is acceptable for the two trees to be different for a few days on such important subsystems like VM, but it is not acceptable to have them different for months and across many kernel versions.

Nobody has yet dared to speak of a Linux source fork, but this is dangerously close to one.

It became obvious that the VM up to 2.4.10 was a design liability. You can try to fix something that was designed badly, but it will never become a beauty. I think Linus' decision to scrap the old VM and go with the Arcangeli VM was courageous and right. Having a functioning and stable Linux box should not be deferred to 2.5 when we can do it already with 2.4.

Kernel Preemption

But apart from the VM issues, there are other lively debates in the kernel community. There was an interesting interview at http://kerneltrap.com/article.php?sid=328&mode=thread&order=0 with Robert Love, who is leading one of two projects trying to make the Linux kernel fully preemptible. Making the kernel preemptible means making it possible to interrupt whatever the kernel is doing (say, executing a system call) to process some other outstanding task and then return to its original task. Linux, as a multiprocessing OS, obviously always did that for user-land processes. However, many, just like Robert Love, feel that the fact that Linux up to now would not let itself be interrupted contributed to poor latency. Latency describes how quickly you can expect a response from your kernel when you actually need something from it. Note that Linux is not designed as a real-time OS (though there is at least one Linux real-time implementation somewhere), and therefore does not explicitly guarantee latency. User-land programs must be aware of this as, especially with kernel preemption, latencies can be very unpredictable.

Theoretically, an OS will answer faster if it can be interrupted. What does suffer from kernel preemption is the global throughput. If you have a task that gets n seconds within the kernel to complete (let's say executing a given system call takes 0.005 seconds), then all the interruptions add some overhead to switch from one kernel task to another. So, finishing the execution of that system call (in our example) will finally require n+op where p is the frequency of switching and o the static overhead for one switching operation. Notice that kernel context switching does not invalidate the CPU cache, and is therefore not as expensive as process switching. However, kernel preemption will surely lead to a higher rate of switching from kernel space to user space, because upon preemption the scheduler might decide to give higher priority to a user process.

In other words, kernel preemption does decrease latency but slows down overall throughput. It's the math: nothing to be done against it.

Furthermore, in his interview, Robert Love heavily criticized Linus Torvalds for adopting Andrea Arcangeli's new VM in 2.4.10 and dropping the old van Riel VM.

Well, I did try the patch with kernel 2.4.12 and with pre13. While accurate measurement (which Robert Love provides with the preemption kernel patches) does indeed report an improvement in latency, for the life of me I have not noticed it on an empirical basis.

I really do appreciate Love's work, but I do not fully agree with some of his comments in the interview. First, as Linus himself said, if latency sucks in the kernel then we should check why it sucks, with or without preemptive scheduling. If the latency is bad in the stock kernel, then it should be fixed anyway.

The preemptive kernel 2.4.12 worked fine on my laptops and on my SGI 550 workstation where I do interactive work. The MP3 player very rarely skipped beats when doing heavy background work such as kernel compiling or opening large files in the editor. But for my servers and clusters, the decrease in performance and the unpredictability of latency is a problem. Also, some important patches will not apply to a Love-patched kernel. Mosix, the clustering kernel extension, does not patch correctly, and neither do some versions of the LIDS intrusion detection system.

It is up to each individual user to decide whether or not to use the patch, but is important to understand the implications of using it.

Linux and FreeBSD Revisited

Upon returning home the other week after meeting with Andrea, I went to my lab and searched for the disk images of the server comparison I ran back in January of this year (of FreeBSD 4.1.1 versus Linux 2.4.0). I took the Compaq ML500 server I have been reviewing (2x 1-GHz CPUs, 2-GB RAM) and upgraded both the FreeBSD disk image to 4.4-Stable and the Linux version to 2.4.12. Then, I changed the memory down to 192-MB RAM so as to stress the VM system more. I also upgraded to the latest stable versions of Sendmail (8.12.1) and MySQL (version 3.23.42). Finally, I compiled everything with the latest version of gcc, 3.0.2, and tuned the two instances to the best of my knowledge (softupdates and increased maxusers for FreeBSD, and untouched default values for Linux).

The results were very interesting indeed. Since this benchmark is too much to be handled in this article, Byte.com will post it here soon for you to read.

The story of this article is that the 2.4 kernel has finally grown up with the 2.4.10 release. Not many users outside the relatively small kernel community realize that. Now you know about it, too. Spread the good news and immediately install 2.4.12 on your busy server. The server will thank you for it.

Moshe Bar is a systems administrator and OS researcher who started learning UNIX on a PDP-11 with AT&T UNIX Release 6, back in 1981. Moshe has a M.Sc and a Ph.D. in computer science and writes UNIX-related books.

Copyright 2001