From: Matthew Dillon <dil...@apollo.backplane.com>
Subject: Review and report of linux kernel VM
Date: 1999/01/13
Message-ID: <199901140720.XAA22609@apollo.backplane.com>
X-Deja-AN: 432449302
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Delivered-To: vmailer-hack...@freebsd.org
X-Gateway: Unidirectional mail2news gateway at MPCS
Newsgroups: mpc.lists.freebsd.hackers,muc.lists.freebsd.hackers
X-Loop: FreeBSD.ORG


				General Overview

    I've been looking at the linux kernel VM - mainly just to see what they've
    changed since I last looked at it.  It's quite interesting... not bad at
    all though it is definitely a bit more memory-resource-intensive then
    FreeBSD's.  However, it needs a *lot* of work when it comes to freeing 
    up pages. 

    I apologize in advance for any mistakes I've made!

    Basically, the linux kernel uses persistent hardware-level page tables
    in a mostly platform-independant fashion.  The function of the persistent
    page tables is roughly equivalent to the function of FreeBSD's vm_object's.
    That is, the page tables are used to manage sharing and copy-on-write
    functions for VM objects.

    For example, when a process fork()'s, pages are duplicated literally by
    copying pte's.  Writeable MAP_PRIVATE pages are write-protected and marked
    for copy-on-write.  A global resident-page array is used to keep track
    of shared reference counts.  

    Swapped-out pages are also represented by pte's and also marked for 
    copy-on-write as appropriate.  The swap block is stored in the PFN 
    area of the pte (as far as I can tell).  The swap system keeps a separate
    shared reference count to manage swap usage.  The overhead is around 
    3 bytes per swap page (whether it is in use or not), and another pte-sized
    (int usually) field when storing the swap block in the pagetable.

    Linux cannot swap out its page tables, mainly due to the direct use of
    the page tables in handling VM object sharing.

    In general terms, linux's VM system is much cleaner then FreeBSD's... and
    I mean a *whole lot* cleaner, but at the cost of eating some extra memory.
    It isn't a whole lot of extra memory - maybe a meg or two for a typical
    system managing a lot of processes, and much less for typical 'small'
    systems.  They are able to completely avoid the vm_object stacking
    (and related complexity) that we do, and they are able to completely
    avoid most of the pmap complexity in FreeBSD as well.

    Linux appears to implement a unified buffer cache.  It's pretty 
    straight forward except the object relationship is stored in
    the memory-map management structures in each process rather then
    in a vm_object type of structure.

    Linux appears to map all of physical memory into KVM.  This avoids
    FreeBSD's (struct buf) complexity at the cost of not being able to 
    deal with huge-memory configurations.  I'm not 100% sure of this, but
    its my read of the code until someone corrects me.

				Problems

    Swap allocation is terrible.  Linux uses a linear array which it scans
    looking for a free swap block.  It does a relatively simple swap
    cluster cache, but eats the full linear scan if that fails which can be
    terribly nasty.  The swap clustering algorithm is a piece of crap, 
    too -- once swap becomes fragmented, the linux swapper falls on its face.
    It does read-ahead based on the swapblk which wouldn't be bad if it
    clustered writes by object or didn't have a fragmentation problem.
    As it stands, their read clustering is useless.  Swap deallocation is 
    fast since they are using a simple reference count array. 

    File read-ahead is half-hazard at best.

    The paging queues ( determing the age of the page and whether to 
    free or clean it) need to be written... the algorithms being used
    are terrible.

     * For the nominal page scan, it is using a one-hand clock algorithm.  
       All I can say is:  Oh my god!  Are they nuts?  That was abandoned
       a decade ago.  The priority mechanism they've implemented is nearly
       useless.

     * To locate pages to swap out, it takes a pass through the task list. 
       Ostensibly it locates the task with the largest RSS to then try to
       swap pages out from rather then select pages that are not in use.
       From my read of the code, it also botches this badly.

    Linux does not appear to do any page coloring whatsoever, but it would
    not be hard to add it in.

    Linux cannot swap-out its page tables or page directories.  Thus, idle
    tasks can eat a significant amount of memory.  This isn't a big deal for
    most systems ( small systems: no problem.  Big systems: probably have lots
    of memory anyway ).  But, mmap()'d files can create a significant burden
    if you have a lot of forked processes ( news, sendmail, web server, 
    etc...).  Not only does Linux have to scan the page tables for all the
    processes mapping the file, whether or not they are actively using the
    page being checked for, but Linux's swapout algorithm scans page tables
    and, effectively, makes redundant scans of shared objects.

			     What FreeBSD can learn

    Well, the main thing is that the Linux VM system is very, very clean
    compared to the FreeBSD implementation.  Cleaning up FreeBSD's VM system
    complexity is what I've been concentrating on and will continue to 
    concentrate on.   However, part of the reason that FreeBSD's VM system 
    is more complex is because it does not use the page tables to store 
    reference information.  Instead, it uses the vm_object and pmap modules.
    I actually like this feature of FreeBSD.  A lot. 

    The biggest thing we need to do to clean up our VM system is, basically,
    to completely rewrite the struct buf filesystem buffering mechanism to
    make it much, much less complex - basically it should only be used as
    placeholders for read and write ops and not used to cache block number
    mappings between the files and the VM system, nor should it be used to
    map pages into KVM.  Separating out these three mechanisms into three
    different subsystems would simplify the code enormously, I think.  For
    example, we could implement a simple vm_object KVM mapping mechanism
    using FreeBSD's existing vm_object stacking model to map portions of a
    vm_object (aka filesystem partition) into KVM.

    Linux demarks interrupts from supervisor code much better then we do.
    If we move some of the more sophisticated operational capabilities
    out of our interrupt subsystem, we could get rid of most of the spl*()
    junk we currently have to do.  This is a real sore spot in current
    FreeBSD code.  Interrupts are just too complex.  I'd also get rid of
    FreeBSD's intermediate 'software interrupt' layer, which is able to
    do even more complex things then hard interrupt code.  The latency
    considerations just don't make any sense verses running pending software
    interrupts synchronously in tsleep(), prior to actually sleeping.  We 
    need to do this anyway ( or move softints to kernel threads ) to be able
    to take advantage of SMP mechanisms.  The *only* thing our interrupts
    should be allowed to do is finish I/O on a page or use zalloc().  

						-Matt

					Matthew Dillon 
					<dil...@backplane.com>


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

From: Stephen Hocking-Senior Programmer PGS Tensor Perth <shock...@prth.pgs.com>
Subject: Re: Review and report of linux kernel VM
Date: 1999/01/15
Message-ID: <199901150137.JAA22504@ariadne.tensor.pgs.com>#1/1
X-Deja-AN: 432773393
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Delivered-To: vmailer-hack...@freebsd.org
Content-Type: text/plain; charset=us-ascii
X-Gateway: Unidirectional mail2news gateway at MPCS
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.hackers,muc.lists.freebsd.hackers
X-Loop: FreeBSD.ORG



Looks good, fair & accurate. Now, do you have the cojones to post it to 
linux-kernel? 8^)


	Stephen
-- 
  The views expressed above are not those of PGS Tensor.

    "We've heard that a million monkeys at a million keyboards could produce
     the Complete Works of Shakespeare; now, thanks to the Internet, we know
     this is not true."            Robert Wilensky, University of California



To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

From: "Daniel C. Sobral" <d...@newsguy.com>
Subject: Re: Review and report of linux kernel VM
Date: 1999/01/15
Message-ID: <369E9EF6.179F1A84@newsguy.com>#1/1
X-Deja-AN: 432773394
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Transfer-Encoding: 7bit
References: <199901150137.JAA22504@ariadne.tensor.pgs.com>
Delivered-To: vmailer-hack...@freebsd.org
X-Accept-Language: pt-BR,ja
Content-Type: text/plain; charset=us-ascii
X-Gateway: Unidirectional mail2news gateway at MPCS
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.hackers,muc.lists.freebsd.hackers
X-Loop: FreeBSD.ORG


Stephen Hocking-Senior Programmer PGS Tensor Perth wrote:
> 
> Looks good, fair & accurate. Now, do you have the cojones to post it to
> linux-kernel? 8^)

I think Dillon actually worked on Linux VM before seeing the light,
isn't that so?

--
Daniel C. Sobral			(8-DCS)
d...@newsguy.com

	If you sell your soul to the Devil and all you get is an MCSE from
it, you haven't gotten market rate.

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

From: Matthew Dillon <dil...@apollo.backplane.com>
Subject: Re: Review and report of linux kernel VM
Date: 1999/01/14
Message-ID: <199901150356.TAA28463@apollo.backplane.com>#1/1
X-Deja-AN: 432809071
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Delivered-To: vmailer-hack...@freebsd.org
X-Gateway: Unidirectional mail2news gateway at MPCS
Newsgroups: mpc.lists.freebsd.hackers,muc.lists.freebsd.hackers
X-Loop: FreeBSD.ORG


:Stephen Hocking-Senior Programmer PGS Tensor Perth wrote:
:> 
:> Looks good, fair & accurate. Now, do you have the cojones to post it to
:> linux-kernel? 8^)
:
:I think Dillon actually worked on Linux VM before seeing the light,
:isn't that so?
:
:--
:Daniel C. Sobral			(8-DCS)
:d...@newsguy.com
:
:	If you sell your soul to the Devil and all you get is an MCSE from
:it, you haven't gotten market rate.

    I did some minor work on the linux TCP stack about 3 iterations ago.
    Moving to FreeBSD was more an interest issue then a seeing-the-light
    issue.  I also wrote dcron which was used in Linux for a bit, a long 
    time ago.

    It should also be noted that I also did some work with the 4.2 
    and 4.3 kernels at UCB, long before either FreeBSD or Linux came
    on the scene and that is another reason why I moved over to FreeBSD.
    ( I changed the serial driver on the Perkin Elmer we had to use
    microcoded DMA rather then discrete serial interrupts.  Unfortunately,
    the microcode was still subject to interrupt disablement so the
    improvement to the serial subsystem was only moderate ).

    I do feel that FreeBSD works better as an ISP platform, though,
    which is one reason why BEST went with FreeBSD rather then Linux.

					-Matt

					Matthew Dillon 
					<dil...@backplane.com>

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

From: Ville-Pertti Keinonen <w...@iki.fi>
Subject: Re: Review and report of linux kernel VM
Date: 1999/01/22
Message-ID: <8690evpkc4.fsf@not.oeno.com>#1/1
X-Deja-AN: 435633909
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <199901140720.XAA22609@apollo.backplane.com>
Delivered-To: vmailer-hack...@freebsd.org
X-Gateway: Unidirectional mail2news gateway at MPCS
Newsgroups: mpc.lists.freebsd.hackers,muc.lists.freebsd.hackers
X-Loop: FreeBSD.ORG



dil...@apollo.backplane.com (Matthew Dillon) writes:

>     In general terms, linux's VM system is much cleaner then FreeBSD's... and
>     I mean a *whole lot* cleaner, but at the cost of eating some extra memory.

Whaat?

You appear to be confusing cleanliness (as I understand it, and I'm
afraid that many other readers of your review might understand it)
with simplicity.

I would claim the exact opposite.  The Linux VM system is simpler, but
far *less* clean because of the very inflexible (almost non-existent)
"layers".  Not to mention the code, which shares the (IMHO) poor
source organization and apparently arbitrary dependencies of Linux as
a whole.

The Linux VM system looks like it hasn't been designed at all, just
implemented.  The basic organization (although not that much of the
code) is still based on versions of Linux that only had one page table
for all tasks, was fully i386-specific (complete with hard-coded
constants all over the place), implemented shared libraries by having
a per-task array of shared library addresses, sharing pages by
scanning through all tasks to find one with the same library etc.
Subsystems in Linux don't seem to get re-written, they only evolve.

Perhaps my critique addresses the Linux code more than the
functionality...but they are related.

What's my definition of clean then?  For example, common operations
shouldn't need to resort to brute-forceish approaches (not in many
cases, anyhow).  Which reminds me, your swp_pager_meta_free_all looks
a bit frightening...do you intend to keep it like it is?

> 				Problems

 ...

In addition to the problems you stated, as far as I can tell, swap
backing is not shared for copy-on-write associations (copy-on-write
pages get swapped out multiple times, all but the last don't free any
memory) unless the page was swapped out when the maps were copied, in
which case it ends up copy-on-access...maybe, I'm not sure whether the
swap cache eliminates this.

This (and many of the things you pointed out) is due to the simplistic
approach where pages don't really have an identity (only mappings)
unless they are backed by an inode.  Which is perhaps at the core of
most of the algorithmic differences between Mach/4.4BSD and Linux VM
systems.

IMHO pages need to have an identity even when they are not associated
with files (based on a quick glance, NetBSD's UVM seems to retain
this property while optimizing the management of anonymous pages.  I'm
not convinced in terms of the choice of data structures for the anon
maps in UVM, though).

>     reference information.  Instead, it uses the vm_object and pmap modules.
>     I actually like this feature of FreeBSD.  A lot. 

Additionally, the way FreeBSD does things has better potential for
concurrency (even though the locks have been ripped out) compared to
Linux.

>     Linux demarks interrupts from supervisor code much better then we do.

You seem to consider simpler to mean cleaner/better.  Although in this
case, I'd agree that much of the complexity of FreeBSD is unnecessary.

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

From: dil...@apollo.backplane.com (Matthew Dillon)
Subject: Re: Review and report of linux kernel VM
Date: 1999/01/22
Message-ID: <199901222036.MAA56617@apollo.backplane.com>
X-Deja-AN: 435757119
Approved: n...@camelot.de
References: <199901140720.XAA22609@apollo.backplane.com>
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 917037589 24147 195.30.224.3 (22 Jan 1999 20:39:49 GMT)
Organization: Mail2News Gateway at CameloT Online Services
NNTP-Posting-Date: 22 Jan 1999 20:39:49 GMT
Newsgroups: muc.lists.freebsd.hackers,mpc.lists.freebsd.hackers

:Whaat?
:
:You appear to be confusing cleanliness (as I understand it, and I'm
:afraid that many other readers of your review might understand it)
:with simplicity.
:
:I would claim the exact opposite.  The Linux VM system is simpler, but
:far *less* clean because of the very inflexible (almost non-existent)
:"layers".  Not to mention the code, which shares the (IMHO) poor
:source organization and apparently arbitrary dependencies of Linux as
:a whole.

    The Linux VM system implements all the core features that the 
    FreeBSD VM system implements, just not as efficiently.  Its 
    use of a page table paradigm to do VM-specific object layering
    is really not that bad of an idea.  It *does* lock them into a more
    ridgid scheme ( for example, the linux scheme starts to break down
    when you share huge objects between processes ), but so far they've been
    able to implement the same core feature set that we have in our VM system.
    Thus, it is not possible to argue that their system is inferior from an 
    algorithmic standpoint, only from an implementation standpoint and a
    flexibility standpoint.

    We can hardly be proud of our VFS/BIO layering which has been so buggy
    these last few years.  The types of bugs I'm finding in FreeBSD have
    nothing to do with the algorithms and everything to do with the code
    being uncommented and virtually unreadable due to the hundreds of 
    badly thought out optimizations and other hacks that have obscured the
    core implementation.

    When I say clean, I mean 'readable, obvious, and functionallty layered'.
    I had no trouble following the linux code even going deep into the paging
    and VFS subsystems.  Following FreeBSD code has been like pulling nails.
    It's why we are *still* finding bugs in our VM system, after years of work.

    FreeBSD's VM system is definitely more flexible and efficient.  Given the
    choice, I would much rather keep FreeBSD's VM system.  That flexibility
    has come at the cost of dirtying up the code considerably, though.   What
    use is flexibility if every new feature brings half a dozen bugs to light
    and creates half a dozen more of its own? 

    My current work is to keep the flexibility while cleaning up the code.  
    If we can clean up the code, we will have a clean, flexible, AND kickass
    VM system rather then simply a kickass VM system.

:
:What's my definition of clean then?  For example, common operations
:shouldn't need to resort to brute-forceish approaches (not in many
:cases, anyhow).  Which reminds me, your swp_pager_meta_free_all looks
:a bit frightening...do you intend to keep it like it is?

    'efficiency'.  As I stated, Linux's VM code is not terribly efficient.
    I would disagree with the 'brute force' line, though.  They've stuck
    to their guns pretty well and the core concepts are sound.  Linux simply 
    has not had the long operational history that BSD has and they are having
    to relearn many of the same lessons.  

    It should be noted that linux can still implement inode-based object
    layering underneath their existing VM system.  Their direct use of
    pagetables for bookkeeping does not prevent that.

:In addition to the problems you stated, as far as I can tell, swap
:backing is not shared for copy-on-write associations (copy-on-write
:pages get swapped out multiple times, all but the last don't free any
:memory) unless the page was swapped out when the maps were copied, in
:which case it ends up copy-on-access...maybe, I'm not sure whether the
:swap cache eliminates this.
:
:This (and many of the things you pointed out) is due to the simplistic
:approach where pages don't really have an identity (only mappings)
:unless they are backed by an inode.  Which is perhaps at the core of
:most of the algorithmic differences between Mach/4.4BSD and Linux VM
:systems.
:
:IMHO pages need to have an identity even when they are not associated
:with files (based on a quick glance, NetBSD's UVM seems to retain
:this property while optimizing the management of anonymous pages.  I'm
:not convinced in terms of the choice of data structures for the anon
:maps in UVM, though).

    Pages under linux *DO* have an identity, but you have to look it up
    in the meta objects backing the page tables based on the position of the
    page in the page table.  They do not implement swap as a paging layer as
    we do, but then again our implementation of swap as a paging layer is
    a mostly degenerate case in our vm_object layering system so it amounts
    to pretty much the same thing.

    I don't think COW pages get swapped multiple times, but I could be wrong.
    My read is that when a linux process forks, the swap block associates are
    shared even for COW pages.  The COWed pages are marked read-only and 
    split if a write fault occurs.  Unless it's writing the same shared
    page from different processes to the same swap block over and over again,
    that is.  It shouldn't have to - I was under the impression that the 
    swap had a bunch of per-swap-block flags to keep track of the clean/dirty
    state, so once one process swaps out a page, the others may scan it but
    will not redundantly swap it out.

:>     reference information.  Instead, it uses the vm_object and pmap modules.
:>     I actually like this feature of FreeBSD.  A lot. 
:
:Additionally, the way FreeBSD does things has better potential for
:concurrency (even though the locks have been ripped out) compared to
:Linux.

    I disagree.  FreeBSD still must hold locks through pmap changes and those
    scan all related processes, just as linux does.  The difference is that
    since FreeBSD can delete page tables, it generally winds up scanning many
    FEWER processes to change the pmap state for a page then linux.  Linux
    must scan/adjust the pmap state for e very process mmap()ing the page
    whether or not it is using the page.

:>     Linux demarks interrupts from supervisor code much better then we do.
:
:You seem to consider simpler to mean cleaner/better.  Although in this
:case, I'd agree that much of the complexity of FreeBSD is unnecessary.
:

    My philosophy is, in general, that (1) one must separate the algorithm
    from the implementation and that (2) any algorithm can be cleanly 
    implemented.  If it isn't, it should be rewritten.  If the programmer
    can't reimplement it, either the programmer is unworthy of the algorithm
    or the programmer isn't experienced enough to do it right, or the
    algorithm is bad.  There is no middle ground in my world view.

    					-Matt
					Matthew Dillon 
					<dil...@backplane.com>


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message