Too little, too slow

An introduction to memory management
and
Linux memory management today and tomorrow

Rik van Riel
Conectiva S.A. http://www.conectiva.com/
riel@conectiva.com.br

Generated from the Magicpoint slides 27/nov/2000

(page 1)

Too little, too slow

Memory hierarchy
- Caches
  - mapping, associativity, coherency
- Main memory
  - virtual memory, virtual -> physical mapping
  - page faults, page replacement
- Disk
  - performance characteristics
- HSM

Something on Linux VM in 2.2, 2.4 and 2.5
- how it (doesn't) work(s)
- how to fix things
  - emergency fixes for 2.4
  - stuff we want/need in 2.5

(page 2)

Memory hierarchy

(page 3)

Memory hierarchy: basics

The hierarchy exists because
- fast memory is expensive and has to be small
- slow memory is cheap and can be big

Latency
- how long do I have to wait for the data?
- (cannot do anything while waiting)

Throughput
- how many MB / second?
- not important when you're waiting

Total time = latency + (amount of data / throughput)

(page 4)

Memory hierarchy: who does what?

CPU core
- programmed by the programmer
- speed optimisations mostly invisible
L1, L2 (and L3) cache
- managed by the hardware
- invisible, except for speed differences
RAM
- managed by the OS
- RAM management invisible for user applications
Disk
- managed by the OS
- programs open files, read / write data, etc.

(page 5)

Memory hierarchy: numbers

Below is a table with some "randomly" chosen CPU types and their L1 and L2 cache characteristics. The exact numbers are not very important, but it is good to realise that even the L2 cache is not fast enough to let the CPU core run at 4% of its maximum speed...

Cache size & speed vs. CPU model	Celeron A	PIII	Athlon	K6-III
L1 size	32kB	32kB	128kB	64kB
L2 size	128kB	512kB	512kB	256kB
L1 lat	3	3	3	2
L2 lat	8	24	21	11
L2 total	11	27	24	13
L1 ass.	4	4	2	2
L2 ass.	8	4	2	4

As you can see, RAM and hard disk speeds are so rediculously low that you must avoid RAM and disk access at all cost if you want performance.

RAM latency: 50 - 400 cycles
Disk latency: > 3.000.000 cycles

(page 6)

Memory hierarchy: working set

Why do we still have performance?
- locality of reference
  - most accesses in a small part of memory
  - most data rarely accessed
  - just accessed data often used again soon
- predictable access patterns
  - pipelining
  - readahead

Of course this breaks in applications that don't exhibit good locality \
of reference. Luckily it works well for most programs.

(page 7)

Cache

L1 cache
- closest to the CPU
- fast
- simple
- small (16 - 128kB)
- separate instruction and data caches

L2 cache
- between L1 cache and memory
- larger then L1 cache (128kB - 8MB)
- more complicated
- slower
- unified cache

(page 8)

Cache: how it works

Cache strategies
- cache line
  - data
  - metadata
- mapping
  - direct mapped
  - N-way set associative
  - fully associative
- replacement
  - (direct mapped)
  - LRU
  - semi-random
- write strategy
  - write through
  - write back
  - write around

(page 9)

Cache: dirty memory

When the CPU writes out a value
- we cannot just forget about it
- it has to be written to (slow) memory

Who writes back
- write through
  - CPU writes back the data
- write back
  - cache writes back the data
  - cpu can do something else at the same time

The fastest way of getting a cache line replaced is by not writing \
to it, so we CAN just forget about it when we need to cache something \
else.

(page 10)

Cache: coherency

We must make sure that we won't write stale data to disk or work on \
stale data from another CPU.

Strategies
- selectively flush the cache
- use uncached memory for certain things
- don't allow shared data
- MESI (modified, exclusive, shared, invalid)

(page 11)

Cache: program optimisation

This example is from the Linux kernel source code, from \
kernel/sched.c::reschedule().

When all running processes have used their time slice:

p->counter == 0 for running processes
we recalculate p->counter for all processes
for most processes, p->counter will not change

recalculate:
	for_each_task(p)
		p->counter = (p->counter >> 1) + p->priority;

(page 12)

Cache: program optimisation

The magic trick:

recalculate:
	for_each_task(p)
		p->counter = (p->counter >> 1) + p->priority;

recalculate:
	for_each_task(p) {
		if (p->counter != (p->counter >> 1) + p->priority)
			p->counter = (p->counter >> 1) + p->priority;
	}

(page 13)

RAM

Main memory
- slowest volatile memory
- fairly big (32MB - 8GB)
- management is not done by hardware, but by the OS
- complex memory management is worth it
  - ram is 100 x slower than cpu core
  - ram is 100.000 x faster than disk
- allocation granularity in pages (4 to 64kB)
  - (128 x bigger than a cache line)

(page 14)

RAM: virtual memory

Virtual memory
- can use more address space than available memory
- can run more programs than fit in memory
- every process has its own "virtual" address space
  - processes cannot access each other's memory
  - processes do not have to know about each other
  - page 0 in one process is NOT page 0 in another process
- not all pages of a program have to be in memory
- page faults
  - mapping
  - page replacement
- virtual -> physical mapping
  - MMU
  - page tables
  - TLB

(page 15)

RAM: virtual memory

(page 16)

RAM: virt -> phys mapping

MMU
- Memory Management Unit
- does the translation
TLB
- Translation Lookaside Buffer
- cache for the MMU translations
Page tables
- index of which process (virtual) address is where in physical memory
- often multi-level page tables
If an address is not in the TLB
- hardware lookup by MMU (x86, m68k)
- software lookup by OS (MIPS, Alpha?, PPC)

(page 17)

RAM: page faults

If a page isn't in memory
- processor traps with a PAGE FAULT
- OS gets the error and the job of fixing it
  - grab a page from the free list
  - read in the data from disk (if needed)
  - write the page table entry
  - let the program continue from where it aborted
- the process continues like nothing has happened
  - transparent (the process never sees the page fault)
  - slow (because disk is very very slow)
if the system gets low on free memory
- make more free pages
- chose the right page to evict/replace/page out/...

Because a page fault is very Very VERY slow, we really want to \
minimise the amount of page faults.

(page 18)

RAM: page replacement

If free memory is low, chose a page to free
- perfect page replacement
  - evict the page that will not be used again for the longest time
  - if you do that every time, the number of page faults is minimal
  - optimal
  - violates fundamental law of nature
- Least Recently Used (LRU)
  - evict the page that hasn't been used for the longest time
  - very good for some situations
  - absolute worst-case for some other (common) situations
    - streaming IO
    - working set just over memory size
  - high overhead
- Least Frequently Used (LRU)
  - evict the page that has been used the least often
  - very slow to adopt to changing workload
    - eg. multi-pass compiler
  - high overhead
- LRU approximations
  - evict any page which hasn't been used for a long time
  - simple
  - low overhead
  - good enough

(page 19)

RAM: LRU approximations

NRU
- not recently used
- very rough LRU approximation
- idea: don't evict pages which were just used
NFU
- not frequently used
- keep pages which have been used often in memory
- idea: don't evict pages which are used over and over again
Page aging
- combines good points of NRU and NFU
- relatively low overhead
- used in Linux 2.0, FreeBSD and Linux 2.4
- when a page is used, increase page->age
  - page->age += MAGIC_NUMBER
- when a page is not used, decrease page->age
  - page->age -= ANOTHER_MAGIC_NUMBER
  - page->age = page->age / 2
- remove page from memory when page->age reaches 0

(page 20)

RAM: other replacement tricks

Drop behind
- with streaming IO, data is usually only used once
- deactivate the data behind the readahead window
- page is still in the inactive list if needed again soon
Idle swapout
- for swapout, first look at long sleeping processes
- less chance of evicting a used page from a running process
- we swap out idle processes before eating too much from the cache
- less overhead, you don't find recently accessed pages in a process that has been sleeping for 10 minutes

(page 21)

Disk

Hard disk
- persistant storage
- stored on disk
  - executable code
  - system configuration
  - program and user data
  - metadata (an index of what is where)
  - swapped out memory
- big and cheap (2GB - 100GB)
- 100.000 x slower than memory
- "interesting" performance characteristics
  - high throughput (fast)
  - extremely long seek times (slow)
  - RAID and/or smart controllers make latency unpredictable

(page 22)

Disk: performance characteristics

disk consists of
- spinning metal plates with oxide surface
- disk arms with read/write head
latency
- seek time
- rotational delay
- queue overhead
- 5 - 15 milliseconds (100.000 x slower than RAM)
throughput
- density
- rotational speed
- higher near edge of disk, less near the center
- 5 - 40 MB / second (10 x slower than RAM)

(page 23)

Disk: performance optimisations

How to get disk performance good?

readahead
- we read in the data before it is needed
- the program doesn't have to wait
- we waste some memory, possibly evicting useful data
  - adaptive readahead
reduce the amount of disk seeks
- smart filesystem layout
- metadata caching
- I/O clustering
  - read large blocks of data at once
  - less seek time per MB transfered
don't access the disk
- caching of data
- smart page replacement algorithm (RAM)
- read the data before it is needed
RAID

(page 24)

Disk: RAID

Redundant Array of Inexpensive Disks (RAID)
- data is spread out over multiple disks, often with parity
  - store the data on (for example) 4 disks
  - use an extra disk to store "extra" data
- bigger than a single disk
- more reliable
  - if a disk fails, we recalculate the "missing" data from the data we still have and the data on the extra disk
- faster
  - you have multiple disks to read the data from
- cheaper than a Single Large Expensive Disk
- continues to work if a single disk fails
  - people can continue their work
  - no waiting for (old) data on backup tapes

(page 25)

HSM: beyond disk

Hierarchical Storage Management (HSM)
- data silos (tapes + tape robots)
- disks to cache the data from tape
- used for very large amounts of data
  - > 1 TB of data, up to multiple PB (1Mi GB)
  - backups take too long
    - after a 4-day backup, you have backed up old data
    - waiting 4 days to restore the data is not an option
  - cheaper and more reliable than disk

HSM is used for
- meteorological data
- astronomical observations
- particle accellerator measurements
- big (multimedia) data banks

(page 26)

Too little, too slow

Conclusions of the introduction

faster memory is more expensive and smaller
every type of memory is too little or too slow
good locality of reference improves performance
programs can be optimised for good locality of reference
programmers can easily mess up performance
hardware manages L1 and L2 cache
the OS manages RAM and disk
smart RAM management minimises disk use
disk are fast on throughput and very slow on latency
smart filesystem layout, I/O clustering and RAID maximise disk performance

(page 27)

Linux Memory Management

Linux VMM
- different types of physical pages (zones)
  - ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM
- virtual page use
  - page and swap cache
  - shared memory
  - buffer memory
- 2.2 VM
  - do_try_to_free_pages
  - strong and weak points of 2.2 VM
- needed changes in 2.4 VM
  - balanced page aging, multiqueue VM
  - optimised page flushing
- plans for 2.5 / 2.6 VM
- out of memory handling / QoS issues

(page 28)

Linux VM: physical zones

ZONE_DMA
- memory from 0 to 16MB
- can be used for ISA DMA transfers
- can be used for kernel data structures
- can be used for user memory
ZONE_NORMAL
- memory from 16 to 900 MB
- can be used for kernel data structures
- can be used for user memory
ZONE_HIGHMEM
- memory > 900 MB
- can be used for user memory
- highmem pages need to be copied to/from normal memory for \

actions like swapin and swapout

On x86 the zones happen to be "inclusive", but this is not the case for \
some other architectures (eg. with NUMA-like setups).

(page 29)

Linux VM: virtual page use

mapped pages
- mapped in process page tables
- shrunk by swap_out()
- placed in other cache
pagecache & swapcache
- pages here can be mapped too
- caches parts of files and/or swap space
- shrunk by shrink_mmap()
- in 2.2, cannot hold dirty pages
shared memory
- SYSV or POSIX shared memory segments
- processes can and unmap map these segments
- swapped out with shm_swap()
buffer cache
- cached disk data that doesn't fit in the page cache
- buffer heads map the data to a block on disk
- dirty file cache data (in 2.2 only)
- shrunk by shrink_mmap()
kernel memory
- cannot be swapped out
- used for
  - task struct / kernel stack
  - page tables
  - slab cache

(page 30)

Linux VM 2.2: try_to_free_pages

do_try_to_free_pages()
- first, free unused kernel memory
- in a loop, until enough pages are freed
  - shrink_mmap
  - shm_swap
  - swap_out
- one last shrink_mmap, if needed

swap_out()
- walks the virtual memory of all processes
- if it finds a page not accessed since last scan
  - swap out the page and exit

shrink_mmap()
- walks the page and buffer cache pages
- if it finds a page not accessed since last scan
  - if the page is dirty, queue it for IO
  - if the page is clean, free it and exit

You can read the functions in mm/vmscan.c and mm/filemap.c

(page 31)

Linux 2.2 VM: strong / weak points

strong points
- simple
- relatively low overhead for most situations
- works well most of the time

weak points
- doesn't take relative activity of different caches into account
  - eg. unused shared memory segments and heavily used page cache
- simple NRU replacement makes us copy too many pages to/from highmem
- accessed clean pages sometimes get evicted before old dirty pages
- NRU replacement breaks when all pages were accessed (because \

we haven't scanned memory in a long time)

doesn't work well under wildly varying VM load
easily breaks down under very heavy VM load

(page 32)

Linux VM: changes in 2.4

balanced page aging
- page aging
- multiqueue VM
smarter page flushing
- page_launder()
make VM more robust under heavy loads
drop-behind for streaming IO and file writes
- not (yet) supports mmap and friends

Not yet in 2.4 as of september 2000

page->mapping->flush() callback
- journaling, phase tree or soft update filesystems
- page flushing for bufferless (network?) filesystems
pinned buffer reservations
- for journaling, etc. filesystems
write throttling for everything
- currently only works for generic_file_write and friends

(page 33)

Linux VM: 2.4 / 2.5 page aging

balanced page aging
- mapping/unmapping pages takes overhead, so we want few inactive pages
- however, more inactive pages results in better page replacement
- the solution is a dynamic inactive target
separate page aging and page flushing
- page aging is done the same on dirty and clean pages
- the "lazy flush" optimisation is only done for old pages
page pages from all caches on the same queue
- physical page based page aging
- requires physical -> virtual reverse mappings
- fixes balancing issues
- big change, to be done for 2.5
do light background scanning
- no half-hour old referenced bits lying around
- better replacement with load spike after quiet period

(page 34)

Linux VM: multiqueue VM

multiple queues used to balance page aging and flushing
active
- pages are in active use, page->age > 0 or mapped
- page aging happens on these pages
inactive_dirty
- pages not in active use, page->age == 0
- pages that might be(come) reclaimable
  - dirty pages
  - buffer pages
  - pages with extra reference count
- cleaned in page_launder(), moved to inactive_clean list
inactive_clean
- pages not in active use, page->age == 0
- pages which are certainly immediately reclaimable
- can be reclaimed directly by __alloc_pages
- keep data in memory as long as possible

(page 35)

Linux VM: balancing aging and flushing

static free memory target
- free + inactive_clean > freepages.high
- enforced by kswapd
- if free memory is too low, allocations will pause
dynamic inactive memory target
- free + inactive_clean + inactive_dirty > freepages.high + inactive_target
- enforced by kswapd
page flush target
- free + inactive_clean > inactive_dirty
- enforced by bdflush / kflushd
memory_pressure / inactive_target
- one-minute floating average of page replacements
- in __alloc_pages and page_reclaim, memory_pressure++
- in __free_pages_ok, memory_pressure--
- memory_pressure is decayed every second
- inactive_target is one second worth of memory pressure

(page 36)

Linux 2.4 VM: try_to_free_pages

do_try_to_free_pages()
- work towards free target
  - kmem_cache_reap()
  - page_launder()
- work towards free and inactive target
  - refill_inactive()

refill_inactive()
- basically the old try_to_free_pages() with page aging
  - has the same balancing issues as 2.2
  - shrink_mmap() is gone (wooohooo)
  - in 2.5, replace with physical page based aging
- moves pages to the inactive list instead of freeing them

(page 37)

Linux 2.4 VM: page_launder

moves clean, unmapped pages from inactive_dirty to inactive_clean list
flushes dirty pages to disk, but only if needed

pass 1
- move accessed or mapped pages back to the active list
- move clean, unmapped pages to inactive_clean list
- free clean buffercache pages
if there is enough free memory after pass 1, stop
pass 2 (launder_loop, clean dirty pages)
- start async IO on up to MAX_LAUNDER pages
- start synchronous IO on a few pages, but only if we failed to free any page in pass 1

this function is so complex because
- we do not want to start disk IO if it isn't needed
- we do not want to wait for disk IO
- we do not want to start too much disk IO at once

(page 38)

Linux VM: flush callback & pinned pages

page->mapping->flush(page)
- give filesystems more freedom for own optimisations
  - IO clustering
  - allocate on flush / lazy allocate
- advisory call, filesystem can write something else instead
  - write ordering of journaling or phase tree fs
  - not flush the page now because of online fs resizing, etc
- pages no longer depend on the buffer head
  - allocate on flush
  - can use other data structures (kiobuf)
- abstract away page flushing from VM subsystem

pinned reservations for journaling filesystems
- sometimes need extra pages to finish a transaction
  - deadlock potential
- having too many pinned pages screws up page replacement
  - write ordering constraints

(page 39)

Linux VM: plans for 2.5

physical page based aging & equal aging of all pages
improved IO clustering and readahead
anti-thrashing measures (process suspension?)
pagetable sharing
(maybe) page table swapping
large page sizes
NUMA scalability

(page 40)

Linux VM: plans for 2.5

physical page based aging
- virtual scanning based page aging has some "artifacts"
- aging all pages equally fixes the balancing issues
- requires physical -> virtual reverse mappings

improved IO clustering and readahead
- better IO clustering reduces disk seeks a lot
- explicit IO clustering for swapout
- readahead and read clustering on the VMA level
- drop-behind for streaming mmap() and swap

anti-thrashing measures
- the system only supports up to some limit of VM load
- if more processes are running, the system slows to a halt
- solution
  - temporarily suspend one or more of the processes
  - the VM subsystem can keep up with the remaining load
  - suspend / resume the processes in turn, for fairness

(page 41)

Linux VM: more plans for 2.5

pagetable handling improvements
- at the moment, page tables are
  - unswappable
  - unfreeable (until process exit)
  - up to 3MB per process
- we want to reduce this overhead
  - (COW) page table sharing
  - page table destruction / swapping

large page sizes
- some architectures have "variable" page sizes
- Linux needs to have some code changed to support this right

NUMA (and SMP) scalability
- NOT about finer grained locks, that won't work
- move currently global variables and structs to the per-node struct pgdat
- look into using more (per-cpu?) local structures
- look into making code paths lock-less

(page 42)

Linux VM: OOM / QoS

when the system goes OOM (out of memory), kill a process
- don't kill system-critical programs
- lose the minimal amount of work
- recover the maximum amount of memory
- never kill programs with direct hardware access

Linux overcommits memory, processes can reserve more memory than what's available
- this is very efficient
- but sometimes goes wrong
- it might be good to give people a choice about this ;)

Quality of Service issues
- fairness between users and/or groups of users
  - RSS guarantees
  - VM quotas
- different users with different priorities
- in case of overcommit, take user VM usage into account for OOM killing

(page 43)

Acknowledgements

Thanks go out to

DEC, from who I stole the "pyramid" picture
CMU, where I got the other picture
David Olivieri, who helped improve this slides
Linus Torvalds, for giving us lots of work to do

Links

Linux-MM
- http://www.linux.eu.org/Linux-MM/
VM Tutorial
- http://cne.gmu.edu/modules/VM/
Design Elements of the FreeBSD VM System
- http://www.daemonnews.org/200001/freebsd_vm.html
My home page, where you can download these slides
- http://www.surriel.com/

(page 44)