Out of Memory Killer

Linux-MM docs: the OOM killer

$Id: oom-killer.shtml,v 1.2 2001/02/01 23:04:25 riel Exp $

Since VM (virtual memory) on any system without strict per-user and/or per-group quotas can get completely exhausted, often leaving the system catatonic, it was clear that Linux needed a type of emergency recovery code to recover virtual memory when VM is completely exhausted.

When VM is exhausted, the OS can do things like adding swap space, suspending processes and writing their image to files and many more things, but even with all these (currently not implemented) tricks, there will be a point where the OS just cannot go any further.

In that situation, the only solution is to kill a process, recovering the memory that that process was using. Killing a process is arguably a bad thing to do, but most people seem to agree that it is better than doing nothing and letting the system hang until either a miracle happens or the system administrator walks by to push the reset button.

Killing a process, however, means that the system loses all the work that's been done by the process and possibly the availability of a system service, which would lead to an unusable system too. The OOM (out of memory) killer does its best to minimise the damage by making a careful choice which process to kill, instead of randomly killing something.

The goals of the OOM killer are diverse:

don't kill important system services, otherwise the system would still be as good as dead
minimise the amount of work lost
free up as much memory as possible
be predictable, don't cause nasty surprises
be simple and small

Luckily most of these goals are easy to fulfill, as long as we don't try to be perfect but satisfy ourselves with merely "good" behaviour. After all, the OOM killer is mostly about avoiding bad things, so the amount of freedom in chosing a good process to kill is pretty big.

The OOM killer uses the following factors to chose which process to kill in an out of memory situation:

memory use, the more memory a process is using, the more memory we will free up and the higher the likelyhood that this program is too big for the system and couldn't have run to completion anyway
- more memory use increases the likelyhood of being killed
CPU use, the more processor time a process has used, the more work will be lost if we kill this process
- more cpu time decreases the chance of being killed
time since start time, the longer a process has been running, the more likely it is that the process is stable and not "guilty" of exhausting system resources
- a longer run time decreases the chance of being killed
system administrator rights, usually only trusted programs and important system programs run as root or with capabilities enabled
- running as root decreases the chance of being killed
direct hardware access, killing a process which has direct hardware access may lead to hardware getting confused and the machine hanging; also, programs with direct hardware access are usually important for whatever task the system is doing
- direct hardware access decreases the chance of being killed

The system uses these factors in a scoring system and the process which gets the highest amount of points will be killed by the OOM killer. This algorithm takes enough factors into account to chose a good process to kill, yet is simple enough that the results are predictable and the system administrator knows what to expect.

Whenever a process gets killed due to OOM (out of memory), the system will print a message, so the system administrator can see in the logfiles what has happened.

Rik van Riel
riel@conectiva.com.br
01/02/2001