Bugs and wishes in memory management area

From: Jean Francois Martinez <j...@sidney.remcomp.fr>
Subject: Bugs and wishes in memory management area
Date: 1996/11/19
Message-ID: <199611192039.VAA00436@sidney.remcomp.fr>#1/1
X-Deja-AN: 197500883
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: j...@sidney.remcomp.fr
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

Well the bug is about some modules failing loading with "unable to get
DMA memory".  That makes modules unreliable.  Of course I don't see
why modules are apparently using GFP_ATOMIC | GFP_DMA.  When loading a
module I think we can swap.

But GFP_ATOMIC allocation could be improved because we could discard
unmodified pages I think.  Correct me if I am wrong.

Wishes for memory management in 2.1

First: Real swapping.  When two processes are fighting for memory and
memory is tight only way to make progress is freezing one of them and
let the other run unimpeded for a few seconds.

Second: Linux does not write to swap, pages who have never been
modified like code pages.  This way Linux does not lose time wring
them.  But because reading from swap is faster I think when a page is
being recalled for thez upteemth time it would be wise to write it to
swap.  We will get it faster next times.  However it seems it would
greatly complicate the code so perhaps it is not a good idea.  (Of
course this does not apply to pages we are dicarding to make a hole
for DMA).

-- 

			Jean Francois Martinez

From: "David S. Miller" <da...@jenolan.rutgers.edu>
Subject: Re: Bugs and wishes in memory management area
Date: 1996/11/20
Message-ID: <199611200538.AAA05284@jenolan.caipgeneral>#1/1
X-Deja-AN: 197587531
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: da...@jenolan.rutgers.edu
references: <199611192039.VAA00436@sidney.remcomp.fr>
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

   Date: 	Tue, 19 Nov 1996 21:39:19 +0100
   From: Jean Francois Martinez <j...@sidney.remcomp.fr>

   Well the bug is about some modules failing loading with "unable to get
   DMA memory".  That makes modules unreliable.  Of course I don't see
   why modules are apparently using GFP_ATOMIC | GFP_DMA.  When loading a
   module I think we can swap.

No, you cannot page the memory a module uses.  I leave the reader to
consider the monsterous races assosciated with a kernel that can page
itself.

   But GFP_ATOMIC allocation could be improved because we could discard
   unmodified pages I think.  Correct me if I am wrong.

It cannot do this reliably, because there is always the danger of
sleeping for some reason.

   First: Real swapping.  When two processes are fighting for memory and
   memory is tight only way to make progress is freezing one of them and
   let the other run unimpeded for a few seconds.

Consider the live-lock situations that could lead to.

   Second: Linux does not write to swap, pages who have never been
   modified like code pages.  This way Linux does not lose time wring
   them.  But because reading from swap is faster I think when a page is
   being recalled for thez upteemth time it would be wise to write it to
   swap.  We will get it faster next times.  However it seems it would
   greatly complicate the code so perhaps it is not a good idea.  (Of
   course this does not apply to pages we are dicarding to make a hole
   for DMA).

We do this, if a page is unmodified and we have an up to date copy on
either swap or the binary image from where it was executed.  Code
pages are modified!  The dynamic linker when it performs relocations
to the program text to use shared libraries, it does modify and dirty
them.  So they cannot be just dropped, they must be paged to disk.

--------------------------------------------////
Yow! 10.49 MB/s remote host TCP bandwidth  ////
over 100Mb/s ethernet.  Beat that!        ////
-----------------------------------------////__________  o
David S. Miller, da...@caip.rutgers.edu /_____________/ / // /_/ ><

From: Keith Owens <k...@edison.dialix.com.au>
Subject: Pageable kernel (was Bugs and wishes in memory management area)
Date: 1996/11/20
Message-ID: <Pine.BSF.3.95.961120170907.12479A-100000@edison.dialix.com.au>#1/1
X-Deja-AN: 197596528
sender: owner-linux-ker...@vger.rutgers.edu
references: <199611200538.AAA05284@jenolan.caipgeneral>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: k...@edison.dialix.com.au
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On Wed, 20 Nov 1996, David S. Miller wrote:
> No, you cannot page the memory a module uses.  I leave the reader to
> consider the monsterous races assosciated with a kernel that can page
> itself.

It is not impossible in general but the specific design of the Linux
kernel makes it very difficult.  For example, MVS can page most of the O/S
pages.  The MVS nucleus ("kernel") is split into pageable and non-pageable
(fixed) areas, the former even use the MMU to map nucleus addresses to
physical addresses and the physical address changes everytime they are
paged in.

Fixed areas are the paging code itself, the common page tables, data that
is the subject of current I/O and various bits of time critical code that
cannot tolerate the extra time to page-fault into physical memory or run
with interrupts disabled.  Even the page tables for inactive address
spaces ("processes") are paged out. 

Not that I'm suggesting redesigning the Linux kernel this week to make it
pageable.  However it may be worth thinking about as a long term project.
Finer kernel locking is a prerequisite, replacing the current glocal
kernel lock.
---
*** O . C. Software -- Consulting Services for Storage Management, ***
*** Disaster Recovery, Security, TCP/IP, Intranets and Internet    ***
*** for mainframes (IBM, Fujitsu, Hitachi) and Unix.               ***

From: "David S. Miller" <da...@jenolan.rutgers.edu>
Subject: Re: Pageable kernel (was Bugs and wishes in memory management area)
Date: 1996/11/20
Message-ID: <199611200628.BAA05367@jenolan.caipgeneral>#1/1
X-Deja-AN: 197716355
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: da...@jenolan.rutgers.edu
references: <Pine.BSF.3.95.961120170907.12479A-100000@edison.dialix.com.au>
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

   Date: Wed, 20 Nov 1996 17:23:07 +1100 (EST)
   From: Keith Owens <k...@edison.dialix.com.au>

   On Wed, 20 Nov 1996, David S. Miller wrote:
   > No, you cannot page the memory a module uses.  I leave the reader to
   > consider the monsterous races assosciated with a kernel that can page
   > itself.

   It is not impossible in general but the specific design of the Linux
   kernel makes it very difficult.  For example, MVS can page most of the O/S
   pages.  The MVS nucleus ("kernel") is split into pageable and non-pageable
   (fixed) areas, the former even use the MMU to map nucleus addresses to
   physical addresses and the physical address changes everytime they are
   paged in.

Just because it is possible doesn't mean it is not a stupid idea.

I think linux will become pagable as soon as it also begins to do
system calls via RPC...

Oh, and on fine grained locking, another example of how to "get there"
but again a stupid way to do it.  The answer is not spin locks all
over the kernel, every textbook mutual exclusion primitive known to
man, and whatnot, to get SMP.  I leave such monstrosities to Solaris
and others of it's kin, some of which need programs to scan their code
for race conditions.

I quit the day we need that sort of garbage for Linux.

The answer is with self locking data structures and careful coding.  A
well thought out design, and not something meant to meet a deadline.

--------------------------------------------////
Yow! 10.49 MB/s remote host TCP bandwidth  ////
over 100Mb/s ethernet.  Beat that!        ////
-----------------------------------------////__________  o
David S. Miller, da...@caip.rutgers.edu /_____________/ / // /_/ ><

From: Jean Francois Martinez <j...@sidney.remcomp.fr>
Subject: Re: Bugs and wishes in memory management area
Date: 1996/11/20
Message-ID: <199611202158.WAA01515@sidney.remcomp.fr>#1/1
X-Deja-AN: 197827432
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: j...@sidney.remcomp.fr
references: <m0vPyaH-0005KgC@lightning.swansea.linux.org.uk>
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

   From: a...@lxorguk.ukuu.org.uk (Alan Cox)
   Date: Tue, 19 Nov 1996 22:24:00 +0000 (GMT)
   Cc: linux-ker...@vger.rutgers.edu
   Content-Type: text

   > DMA memory".  That makes modules unreliable.  Of course I don't see
   > why modules are apparently using GFP_ATOMIC | GFP_DMA.  When loading a
   > module I think we can swap.

   GFP_DMA implies GFP_ATOMIC (rightly or wrongly)

When saying we can swap I was of course referring to stealing memory
from user processes not from modules.  (MVS is pageable but it is
because it is so big than not even IBM mainframes can hold it).

About modules I still think than we do not need GFP_ATOMIC when
loading them.  Wa need contiguous memory and that can we get it either
by swapping or at least by discarding unmodified pages from user
processes.  Swapping is less prone to failing to find enough DMAable
memory but it is too big a change for putting it in 2.0.  And I don't
think having unrelable modules can wait until 2.2

  > Wishes for memory management in 2.1
   > 
   > First: Real swapping.  When two processes are fighting for memory and
   > memory is tight only way to make progress is freezing one of them and
   > let the other run unimpeded for a few seconds.

   Questionable. The shared set of pages in most systems is a lot of the
   application. Also the 'less regularly used' algorithm behaves far more
   stably with heavy loads. I can see a case for trying to pick more on
   processes that have run a lot but the whole VM/scheduling/IO interaction
   is an incredibly complex and subtle beast. Repeating the formulae of the
   early 1980's isnt always progress

No.  First I am not talking about swapping a whole process like BSD
does.  Just freezing a process (of course the living one will steal a
big chunk of memory from it).

Second.  You have two programs using big arrays or chained lists.
Each one needs 6 Megs of memory for being able to run for 5 ms without
faulting.  A page fault will take 20 ms.  You only have 8 megs of RAM.
I had a situation like this, I used CTRL-Z on one of the processes.

Third: In its advertisements about Free BSD, Infomagic claims
(implictly) than Free BSD can handle heavy loads far better than
Linux.  BSD swaps.  I would really like Linux erasing that stupid
superiority smile of BSDers.

Respective to proprietary systems we can claim than Linux and a lot of
RAM is cheaper and a faster than a proprietary OS and less RAM.  That
is not valid when comparing to BSD.

-- 

			Jean Francois Martinez

From: m...@toaster.roan.co.uk (Mike Jagdis)
Subject: Re: Bugs and wishes in memory management area
Date: 1996/11/21
Message-ID: <slrn5988u1.6tu.mike@toaster.roan.co.uk>#1/1
X-Deja-AN: 197827802
x-nntp-posting-host: whthom.demon.co.uk
references: <199611192039.VAA00436@sidney.remcomp.fr>
x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway)
x-hdr-sender: m...@toaster.roan.co.uk 
x-env-sender: m...@toaster.roan.co.uk
newsgroups: linux.dev.kernel


>But GFP_ATOMIC allocation could be improved because we could discard
>unmodified pages I think.  Correct me if I am wrong.

Yes, try_to_free_page almost but not quite handles no wait requests
suitably (ipc shrinking ignores no wait at least). There are one
or two other stunts we can pull to make DMA requests for multiple
pages more reliable.

  I've been working on page_alloc.c and kmalloc.c for a while now
but I got side tracked with the Cyrix stuff which turned out to
be more involved than I thought :-(.

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Rob Riggs <rri...@tesser.com>
Subject: Re: Bugs and wishes in memory management area
Date: 1996/11/21
Message-ID: <XFMail.961121024422.rriggs@tesser.com>#1/1
X-Deja-AN: 198026129
sender: owner-linux-ker...@vger.rutgers.edu
references: <m0vPyaH-0005KgC@lightning.swansea.linux.org.uk>
content-type: text/plain; charset=us-ascii
x-hdr-sender: rri...@tesser.com
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel




On 20-Nov-96 a...@lxorguk.ukuu.org.uk wrote:
>>
>> DMA memory".  That makes modules unreliable.  Of course I don't see
>> why modules are apparently using GFP_ATOMIC | GFP_DMA.  When loading a
>> module I think we can swap.
>
>GFP_DMA implies GFP_ATOMIC (rightly or wrongly)

>From my reading of the source and headers it seems that GFP_DMA implies
GFP_BUFFER, not GFP_ATOMIC...

>From mm.h: 

> #define GFP_BUFFER      0x00
> #define GFP_ATOMIC      0x01
> #define GFP_USER        0x02
> #define GFP_KERNEL      0x03
> #define GFP_NOBUFFER    0x04
> #define GFP_NFS         0x05
> 
> /* Flag - indicates that the buffer will be suitable for DMA.
>    Ignored on some platforms, used as appropriate on others */
> 
> #define GFP_DMA         0x80
> 
> #define GFP_LEVEL_MASK 0xf

>From kmalloc.c:

>         if (priority & GFP_DMA) {
>                 dma = 1;
>                 type = MF_DMA;
>                 pg = &bucket->dmafree;
>         }
> 
>         priority &= GFP_LEVEL_MASK;

This would seem to imply that one can do:

	ptr = kmalloc(size, GFP_DMA | GFP_KERNEL);

The comments in mm.h clearly state that "GFP_DMA" is a flag, not an
actual priority level. However GFP_DMA & GFP_LEVEL_MASK = GFP_BUFFER.
And GFP_BUFFER does not try very hard to free pages.

I think the original poster is correct in that we could try harder
to free pages if DMAable RAM is not immediately availible, even
by paging if necessary. Can modules sleep during initialization?
If we can't, then the affected modules may need to defer allocation
of the DMA buffers until it can sleep.

Rob
(rri...@tesser.com)

From: Jean Francois Martinez <j...@sidney.remcomp.fr>
Subject: Re: Bugs and wishes in memory management area
Date: 1996/11/22
Message-ID: <199611222047.VAA01260@sidney.remcomp.fr>#1/1
X-Deja-AN: 198276088
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: j...@sidney.remcomp.fr
references: <Pine.LNX.3.91.961122002240.195B-100000@localhost>
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel



   Date: Fri, 22 Nov 1996 01:07:29 +0000 (GMT)
   From: Gerard Roudier <groud...@club-internet.fr>
   X-Sender: groudier@localhost
   cc: Rob Riggs <rri...@tesser.com>, Alan Cox <a...@lxorguk.ukuu.org.uk>,
	   Jean Francois Martinez <j...@sidney.remcomp.fr>,
	   linux-ker...@vger.rutgers.edu
   MIME-Version: 1.0
   Content-Type: TEXT/PLAIN; charset=US-ASCII


   On Thu, 21 Nov 1996, Jon Tombs wrote:

   > 
   > Rob Riggs said:
   > 
   > > This would seem to imply that one can do:
   > > 
   > > 	ptr = kmalloc(size, GFP_DMA | GFP_KERNEL);
   > > 
   > > The comments in mm.h clearly state that "GFP_DMA" is a flag, not an
   > > actual priority level. However GFP_DMA & GFP_LEVEL_MASK = GFP_BUFFER.
   > > And GFP_BUFFER does not try very hard to free pages.
   > 
   > Yes you can do it, but it doesn't help. The problem is in get_free_pages, it

   1) Opinion(s):
    Doing DMA through the ISA bus is the silliest thing you can do with a 
    PCI system.
    Complexifying the kernel in order to use PCI systems as outdated ones 
    IsSillyAnyway(TM).
    16MB of memory is enough for systems that donnot use EISA or PCI bus.

The problem is all x86 machines have ISA connectors so we will have to
support them, please or not.  And there are people using older
machines.  Besides if you don't use DMA how do you transfer data in a
multitasking OS?

   > returns when there is enough free pages, or if called with the ATOMIC flag.

   2) Agreement:
    Enough to make it happy, but sometimes not enough to satisfy the 
    request.

   > But it doesn't check if the free pages are actually suitable for the DMA
   > request. You can fix it easily with an extra if() that avoids returning NULL
   > when called (GFP_DMA | GFP_KERNEL), but this just results in randomly freeing
   > pages until success. 
   > What is really needed is an algorithm to pick the pages, or you will swap
   > several MB in order to create a single 128k DMA buffer (I know, I've
   > tried...).

   3) Irony:
    No need to try it. Can be guessed easily.

   4) Suggestions:
    A suggestion, for > 16MB systems is to garbage if necessary up to 1MB 
    in a pool for ISA/DMA but to keep the kernel simple.
    Kernel modules/drivers that allocate ISA/DMA memory when it is not 
    needed are to be reworked (or rewritten).

   5) Greetings:

   Gerard.

Interesting idea.  But there are still boxes with less than 16 Meg.
Ah the good old days where Linux was usable with 2 megs and was
advertised as being as being small and mean.

-- 

			Jean Francois Martinez

-What is the operating system used by Bill Gates?
-Linux of course.  Do you think stupid people become millionnaire?

From: b...@stat.wisc.edu (Kevin Buhr)
Subject: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/26
Message-ID: <vba7mn8bu70.fsf@mozart.stat.wisc.edu>#1/1
X-Deja-AN: 200888738
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: b...@stat.wisc.edu 
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

An amusing anecdote:

One day, right out of the blue, my poor little 8 meg machine went
loco.  It began generating reams and reams of "Couldn't get a free
page" messages.  I was away from the console, and it churned madly
away for several hours before I was able to power cycle it.

"Fortunately", I'd added the priority and size to the "Couldn't get a
free page" message in my kernel (2.0.13 vintage, I believe), and I
immediately realized that I was seeing request after request for a
2-page block at GFP_NFS priority.  Eventually, I traced it back to
this culprit in "fs/nfs/proc.c":

static inline int *nfs_rpc_alloc(int size)
{
	int *i;

	while (!(i = (int *)kmalloc(size+NFS_SLACK_SPACE,GFP_NFS))) {
		schedule();
	}
	return i;
}

Get it?  *All* my runnable processes wanted a 2-page block of memory
for nefarious NFS-read-related purposes, and the kernel was viciously
failing to grant any of them.  Why, you ask?  Well, observe the
following from the "mm/page_alloc.c" code:

	if ((priority==GFP_ATOMIC) || nr_free_pages > reserved_pages) {
		RMQUEUE(order, dma);
		restore_flags(flags);
		return 0;
	}
	restore_flags(flags);
	if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1))
		goto repeat;
	return 0;

Imagine that kernel memory is sufficiently fragmented that, even
though there are lots of pages available (say more than
"free_pages_high" which is 32 on my little box), there are no 2-page
blocks around.  Note that, provided "nr_free_pages" is large enough,
we never even *get* to "try_to_free_page".  Every multipage memory
request will be flatly refused until the memory becomes magically
"defragmented", which isn't always likely to happen, particularly if
"kswapd" doesn't run.

Since that horrible experience, I've hacked up my kernel so that
"kmalloc" retries multipage, non-GFP_ATOMIC, non-GFP_BUFFER requests
(that can't even be satisfied by the "kmalloc" cache and so would
otherwise result in "Couldn't get a free page" messages) at a magical
"GFP_DEFRAG" priority that will "try_to_free_page" anyway, no matter
how many "nr_free_pages" there may be.

My hack works like a dream: the only "Couldn't get a free page"
messages I get now are at GFP_ATOMIC priority (though, disturbingly
enough, they are multipage requests for 4388 bytes---anyone know what
these are?), and instead I get dozens of "Free list fragmented"
messages associated with 2-page NFS requests whenever the going gets
tough.  On the other hand, I'm well aware that I've merely traded one
race condition for another: for one thing, "try_to_free_page" produces
blocks of consecutive free pages by accident, not on purpose.

			*	*	*

I remember a long time ago, some brave soul was criticizing the
allocation buddy-system; he or she wanted to completely replace it
with a kernel page-table that could produce multipage blocks
automagically, thus eliminating the scourge of memory fragmentation
forever.

(Now you probably know why I chose the "Subject" line I did.)

If I remember correctly, the primary argument against this was the
performance penalty of invalidating the cache after every kernel
memory allocation.  Besides which, it was pretty gross compared to the
superefficient buddy system.

Was this the only argument against the proposed scheme?  Is it a bad
idea to have the kernel use a page-table system to back up the buddy
system?  I'm thinking that a (non-GFP_DMA) multipage request that's in
danger of failing merely because of fragmentation would be satisfied
by RMQUEUEing a bunch of single pages and bundling them together in
the kernel page table up past the physical memory high-water mark.  On
"free", the pages would be "unbundled" and returned to the queue.

I have to claim ignorance here: I don't know how many drivers would
break if there was virtual/physical address dichotomy in (non-GFP_DMA)
kmalloced blocks.  Perhaps we'd have to add a GFP_VIRTOK flag or
something.

Comments?  Flames?

Kevin <b...@stat.wisc.edu>

From: m...@toaster.roan.co.uk (Mike Jagdis)
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <slrn59o1ug.9hc.mike@toaster.roan.co.uk>#1/1
X-Deja-AN: 201001570
x-nntp-posting-host: roan.demon.co.uk
references: <vba7mn8bu70.fsf@mozart.stat.wisc.edu>
x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway)
x-hdr-sender: m...@toaster.roan.co.uk 
x-env-sender: m...@toaster.roan.co.uk
newsgroups: linux.dev.kernel

>If I remember correctly, the primary argument against this was the
>performance penalty of invalidating the cache after every kernel
>memory allocation.  Besides which, it was pretty gross compared to the
>superefficient buddy system.

The buddy system is pretty gross for what we want. You don't need
to pull gross hacks with page tables or use complex algorithms
to handle memory allocation though!

  I've been working on improved page allocation and kmalloc that
doesn't have nasty power-of-two limits, doesn't thrash any more
than necessary and is optimised for the common case. It's a slow
job because at this stage each little change needs careful thought
and a couple of hours benchmarking to see what effect it has. We are
talking a few microseconds variation but this code is called a lot.

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Ingo Molnar <mi...@pc5829.hil.siemens.at>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <Pine.LNX.3.95.961127103914.13206B-100000@pc5829.hil.siemens.at>#1/1
X-Deja-AN: 201006013
sender: owner-linux-ker...@vger.rutgers.edu
references: <slrn59o1ug.9hc.mike@toaster.roan.co.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: mi...@pc5829.hil.siemens.at
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On Wed, 27 Nov 1996, Mike Jagdis wrote:

> >If I remember correctly, the primary argument against this was the
> >performance penalty of invalidating the cache after every kernel
> >memory allocation.  Besides which, it was pretty gross compared to the
> >superefficient buddy system.
> 
> The buddy system is pretty gross for what we want. You don't need
> to pull gross hacks with page tables or use complex algorithms
> to handle memory allocation though!

IMHO, the problem is that once a pointer is given out, you cannot
reogranize your logical->physical memory mappings. With the page table
solution you can. It's a CPU hardware feature that is hard to emulate. 

with the buddy system, once you are fragmented, you can do nothing about
it (other than using double indirection pointers [memory handles] which
basically emulate paging at a cost we probably dont want to pay?). 

and the memory handle stuff isnt good for interrupt handlers ... neither
for SMP? TLB invalidates are basically a hardware-implemented
'handle-invalidate' feature ... we cannot really implement this in
software, can we?

-- mingo

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <Pine.LNX.3.91.961127122646.13538A-100000@toaster.roan.co.uk>#1/1
X-Deja-AN: 201170011
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.95.961127103914.13206B-100000@pc5829.hil.siemens.at>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

> IMHO, the problem is that once a pointer is given out, you cannot
> reogranize your logical->physical memory mappings. With the page table
> solution you can. It's a CPU hardware feature that is hard to emulate. 

But you don't need to reorganize, you don't need paging and you don't
need to emulate it.

  Buddy is expensive because it *always* tries to coalesce pages to
power of two blocks (despite the fact that 99.99% of requests are
for single blocks anyway) and only ever coalesces neighbouring
blocks of the same size (which makes it hideously vulnerable to
fragmentation).

  I use page arrays of arbitrary length (up to a suitably arbitrary
maximum). I use lazy coalescing to optimise the common, single
page case - this imposes an extra overhead on the first miss but
this is not serious for infrequent requests and frequent requests
tend to keep the queues of larger sizes suitably topped up (so
requests show the same overhead as the single page case). Being
able to add a single page to a 4 page array to get 5 is a _lot_
easier than finding two neighbouring 4s to join. When coalescing
pages I check to see if neighbouring pages are in the page cache
but otherwise unused. If so I reclaim them immediately rather
than waiting for shrink_mmap to clock through memory and get
round to releasing them (we could do this at interrupt time too
if I mutex the inode and page cache modifications).

  I do kmalloc in a similar-ish way, treating pages as arrays of
small (16 byte) "paragraphs" (although the optimum solution appears
to be not to coalesce them but allocate new page arrays and let
the current ones fragment away, dropping them when the become
entirely free - 99.99% of kmalloc action is below 1k).

  These schemes allow good handling of requests for arbitrary
amounts of memory but the simplicity of the algorithm gives around
zero cost. My current benchmarking shows the difference in raw
performance (latency and bandwidth) between this and the old buddy
code to be down in the noise - if anything I am a small percentage
faster. I can still trim a few more instructions as well - plus I
have a fair amount of statistics gathering and the like in there :-).

  Hopefully it will be ready for the harsh light of day in a couple
of weeks so you lot can break it :-).

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Ingo Molnar <mi...@pc5829.hil.siemens.at>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <Pine.LNX.3.95.961127160222.2744A-100000@pc5829.hil.siemens.at>#1/1
X-Deja-AN: 201173693
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.91.961127122646.13538A-100000@toaster.roan.co.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: mi...@pc5829.hil.siemens.at
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel



On Wed, 27 Nov 1996, Mike Jagdis wrote:

> > IMHO, the problem is that once a pointer is given out, you cannot
> > reogranize your logical->physical memory mappings. With the page table
> > solution you can. It's a CPU hardware feature that is hard to emulate. 
> 
> But you don't need to reorganize, you don't need paging and you don't
> need to emulate it.

i dont get it [it must be me]. How would you allocate 2 consecutive pages
in this 5-page pool which has 3 free pages? :

   page 1 -------> free
   page 2 -------> used by the Ethernet driver
   page 3 -------> free
   page 4 -------> used by the SCSI driver
   page 5 -------> free

[ sure we are not at page granularity with the buddy system, but the
  original poster proposed an >additional< mechanizm for multipage
  vmalloc() calls, maybe by using paging ]

now i need 2 pages in the above situation ... from an IRQ handler, to get
8192 bytes off the networking card for example. How would you do it?

-- mingo

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <Pine.LNX.3.91.961127170819.17768A-100000@toaster.roan.co.uk>#1/1
X-Deja-AN: 201453123
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.95.961127160222.2744A-100000@pc5829.hil.siemens.at>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel



On Wed, 27 Nov 1996, Ingo Molnar wrote:

> i dont get it [it must be me]. How would you allocate 2 consecutive pages
> in this 5-page pool which has 3 free pages? :
> [...]
> now i need 2 pages in the above situation ... from an IRQ handler, to get
> 8192 bytes off the networking card for example. How would you do it?

You don't. If the allocated pages were swappable _and_ you weren't
in an interrupt you could page one or more out. If you can't do
that you have a problem, sure. However this is a worst-case scenario.
To get to that state you are probably short of memory and have
a high memory load. Normally there are many pages easily reclaimable
even from interrupts. If you seriously want to handle the worst
case by making things worse it would be possible to fall back to
shuffling pages using the MMU (at a price) but I don't think this
is a good idea in the general case.

  (On the other hand the MMU fiddling required is along the lines
of the sort of thing we need to do to support large, 1GB or more,
memory systems...)

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: b...@stat.wisc.edu (Kevin Buhr)
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <vba917nl2ur.fsf@mozart.stat.wisc.edu>#1/1
X-Deja-AN: 201126172
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: b...@stat.wisc.edu 
references: <Pine.LNX.3.91.961127122646.13538A-100000@toaster.roan.co.uk>
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

Mike Jagdis <m...@roan.co.uk> writes:
| 
| But you don't need to reorganize, you don't need paging and you don't
| need to emulate it.

If I understand correctly, the primary strength of the system you are
working on is that it wastes less memory in satisfying allocation
requests.  As the user of a tiny little 8 meg machine, you can't
imagine how much I look forward to getting my filthy hands on this
code.  ;)

However, the buddy system already does a good job of actually
satisfying small-allocation requests, even if it wastes lots of the
space it has available.  That is to say, your new code has the effect
of magically boosting the size of physical memory, but the actual
failure rates of a "kmalloc(1024,GFP_KERNEL)" won't differ, even if
that call is more likely to cause a page fault under the buddy system
than under yours.

My problem is that the buddy system is very prone to fragmentation:
when memory becomes fragmented, it *often* completely fails to satisfy
large allocations (and worse yet, the current kernel mechanisms for
freeing pages make no attempt to address fragmentation problems---they
try to free "any old page" not "a page next to another free page").

It sounds like your system will be better at satisfying large
allocation requests, simply because it will be able to find, for
example, 9000 bytes starting *anywhere* (or almost anywhere) in memory
instead of requiring four consecutive pages.  This seems to be a happy
side effect of your real goals: avoiding big chunks of wasted memory
and optimising the small-allocation case.

The proof is in the pudding.  If your system makes my fragmentation
problem vanish, I will click my heels together with glee and do a
dance of joy.  ;)

However, even after your code is in the kernel, it will still be
possible for a multipage allocation request to fail even though there
are plenty of pages all over the place.  If this situation is very
unlikely, then it's not so bad.  We can still imagine deadlock
scenarios (where every runnable process needs a 4-page block to
continue and yet allocated memory looks like a big checkerboard), but
if they are incredibly improbable, it's no big deal.

My proposed scheme is ugly, but it's for emergencies.  Click, click,
click, click and four nonconsecutive pages have been mapped into the
land of make believe.  One cache invalidate, and we've prevented
deadlock.  Since the virtual address space is so damn big (ha ha), we
can make fragmentation of the "imaginary" memory impossible, simply by
mapping on 32-page boundaries (the size of the maximum allocation
currently allowed by "kmalloc"), and we never have to deal with the
ugliness (that Ingo was talking about) of reorganizing
already-allocated mappings.

It's cost?  (1) Even if an emergency never occurs, every "kfree" will
need another check, but depending on the actual code your new
allocator uses, perhaps we'd be able to get this check done for free;
(2) It might break nasty virtual/physical address dependencies deep
inside some device driver somewhere despite the existence of a
GFP_VIRTOK flag; (3) It's ugly.

I guess we'll know whether these costs outweigh potential benefits
when we get some idea of how likely bad fragmentation is with your new
code in place.  I certainly look forward to giving it a test run!

Kevin <b...@stat.wisc.edu>

From: Ingo Molnar <mi...@pc5829.hil.siemens.at>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/27
Message-ID: <Pine.LNX.3.95.961127213022.10159C-100000@pc5829.hil.siemens.at>#1/1
X-Deja-AN: 201403769
sender: owner-linux-ker...@vger.rutgers.edu
references: <vba917nl2ur.fsf@mozart.stat.wisc.edu>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: mi...@pc5829.hil.siemens.at
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On 27 Nov 1996, Kevin Buhr wrote:

> It's cost?  (1) Even if an emergency never occurs, every "kfree" will
> need another check, but depending on the actual code your new
> allocator uses, perhaps we'd be able to get this check done for free;
> (2) It might break nasty virtual/physical address dependencies deep
> inside some device driver somewhere despite the existence of a
> GFP_VIRTOK flag; (3) It's ugly.

with double indirection pointers we could implement some kind of garbage
collector kernel thread(s). Kernel code has to 'get' and 'put' such
resources, to avoid reentrancy problems.

The benefit would be soft relocatable buffers. [this one could be made
caching aware too, based on access pattern detection in the get/put
logic]. And this could coexist with no-MMU features like the 4MB pages or
the Sparc bigpages or the same Cyrix stuff?

the operations on these buffers would be: allocate(),get(),put(),free().
For once-only buffers it wont do no good, but for persistant buffers it is
an option?

-- mingo

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/28
Message-ID: <Pine.LNX.3.91.961128094919.24362B-100000@toaster.roan.co.uk>#1/1
X-Deja-AN: 201456250
sender: owner-linux-ker...@vger.rutgers.edu
references: <vba917nl2ur.fsf@mozart.stat.wisc.edu>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On 27 Nov 1996, Kevin Buhr wrote:

> If I understand correctly, the primary strength of the system you are
> working on is that it wastes less memory in satisfying allocation
> requests.  As the user of a tiny little 8 meg machine, you can't
> imagine how much I look forward to getting my filthy hands on this
> code.  ;)

Until a couple of months ago I was using a 386/33 with 4MB :-).

> However, the buddy system already does a good job of actually
> satisfying small-allocation requests, even if it wastes lots of the
> space it has available.  That is to say, your new code has the effect
> of magically boosting the size of physical memory, but the actual
> failure rates of a "kmalloc(1024,GFP_KERNEL)" won't differ, even if
> that call is more likely to cause a page fault under the buddy system
> than under yours.

Small allocations, yes. The problem with buddy is that to make, say,
a block of 8 pages it needs to combine two neighbouring blocks of
4 pages and so on. This leads it to only allow certain alignments
of blocks which reduces the possibilities as you go to larger sizes
(and increases the wastage).

  Note that I talked about page allocation above. For kmalloc it
seems better to forget about coalescing fragments. The dominant
kmalloc activity is at relatively small sizes so collecting fragments
is a good thing (when we get a completely free page in the free
lists we pull all the fragments and release the page - modulo a
small delay to avoid thrashing). For larger requests we just grab
more pages and peel off the fragment we want. With the lower size
queues being kept stuffed wih fragments we will have a tendency
to keep a few blocks on the active larger queues. It's like a
cascade through a meat grind - pages go in the top, get fragmented
as needed, and when the little bits drop out the bottom we just
sweep them up and recycle them. It's pretty much self balancing.

> However, even after your code is in the kernel, it will still be
> possible for a multipage allocation request to fail even though there
> are plenty of pages all over the place.

Even single page requests fail eventually :-). Code should be prepared
for things to fail. The likelyhood of something failing is inversely
proportional to the cost of making it work under extreme conditions.
As with everything we just have to do our best within the budget
available :-).

  Of course, you can *already* use paging to "solve" the problem.
Simply ask for memory with GFP_ATOMIC (to avoid a thrash if the
request cannot trivially be satisfied) and if it fails use vmalloc
to map pages into kernel memory. But remember that remapped memory
is not always suitable - DMA...

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/28
Message-ID: <Pine.LNX.3.91.961128094331.24362A-100000@toaster.roan.co.uk>#1/1
X-Deja-AN: 201453149
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.95.961127213022.10159C-100000@pc5829.hil.siemens.at>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel



On Wed, 27 Nov 1996, Ingo Molnar wrote:

> with double indirection pointers we could implement some kind of garbage
> collector kernel thread(s). Kernel code has to 'get' and 'put' such
> resources, to avoid reentrancy problems.

This means mutexing each access through the pointer so it can't
change while we are using the object at the other end. It might
help the worst case but at the expense of extra overhead on every
usage of an allocated object - expensive.

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Ingo Molnar <mi...@pc5829.hil.siemens.at>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/28
Message-ID: <Pine.LNX.3.95.961128131827.13995A-100000@pc5829.hil.siemens.at>#1/1
X-Deja-AN: 201453126
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.91.961128094331.24362A-100000@toaster.roan.co.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: mi...@pc5829.hil.siemens.at
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On Thu, 28 Nov 1996, Mike Jagdis wrote:

> On Wed, 27 Nov 1996, Ingo Molnar wrote:
> 
> > with double indirection pointers we could implement some kind of garbage
> > collector kernel thread(s). Kernel code has to 'get' and 'put' such
> > resources, to avoid reentrancy problems.
> 
> This means mutexing each access through the pointer so it can't
> change while we are using the object at the other end. It might
> help the worst case but at the expense of extra overhead on every
> usage of an allocated object - expensive.

yup ... but note that this is basically a software-implemented paging/TLB
mechanizm. With all the advantages of byte granularity and arbitrary size
buffers. The 'mutex' is just for the garbage collector thread, not
for normal usage. It could be signalled with a high bit in the address or
something like that, and could be macroed heavily.

once you have got() a buffer, you can code along as if it were a normal
buffer, no speed penalty. The garbage collector thread then would
clusterise buffers that arent in use. [they are allocated, but no pointer
is out somewhere in a context]

but anyways, this was just a gedankenexperiment to avoid TLB invalidation.

-- mingo

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/28
Message-ID: <Pine.LNX.3.91.961128101411.24362C-100000@toaster.roan.co.uk>#1/1
X-Deja-AN: 201452850
sender: owner-linux-ker...@vger.rutgers.edu
references: <m0vSqZb-0005FbC@lightning.swansea.linux.org.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On Wed, 27 Nov 1996, Alan Cox wrote:

> There is an even worse problem with the buddy system, especially on CPU's
> without 4 way caches. All our structures tend to have the used stuff at
> the top and data below. Our kmalloc spends all day landing all the 
> critical structures on the same cache line..

That's why I started with the premise that kmalloc should deal in
terms of 16 byte "paragraphs" (32 on an Alpha) and that it should
not blindly pair groups of similar size. (As I've mentioned before,
I gave up coalescing kmalloc fragments completely). After gathering
some stats it seems useful to round requests to about 4*16 but just
add one.

> Some CPU's the TLB is software. On all CPU's I know an SMP CPU invalidation
> needs software and is relatively expensive.

The principle of keep-it-simple applies in buckets. There is a big
price to pay for complexity whether algorithmic or hardware. Unless,
possibly you have absolute control of both the software *and*
hardware design. We wish... :-)

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/29
Message-ID: <Pine.LNX.3.91.961129092948.3799B@toaster.roan.co.uk>#1/1
X-Deja-AN: 201456293
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.95.961128131827.13995A-100000@pc5829.hil.siemens.at>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel



On Thu, 28 Nov 1996, Ingo Molnar wrote:

> once you have got() a buffer, you can code along as if it were a normal
> buffer, no speed penalty. The garbage collector thread then would
> clusterise buffers that arent in use. [they are allocated, but no pointer
> is out somewhere in a context]

This is pretty much what we have. Kmalloc gets the buffer, you use it,
kfree releases it. The only thing it doesn't currently do is to shuffle
allocated things around to make larger chunks. My experimentation
suggests that coalescing in kmalloc is not worthwhile anyway.

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'

From: Mark Hemment <mar...@nextd.demon.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/11/29
Message-ID: <Pine.LNX.3.91.961129202137.142A-100000@nextd.demon.co.uk>
X-Deja-AN: 201565405
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.91.961129093724.3799C-100000@toaster.roan.co.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: mar...@nextd.demon.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

On Fri, 29 Nov 1996, Mike Jagdis wrote: 
> On Thu, 28 Nov 1996, Mark Hemment wrote:
> > Unfortunately, when searching for a page to move into the free memory 
> > pool no weight is given to whether the released page will improve the 
> > quality of the memory-pool (such as allowing coalescing).

> 99.99% of the time we want a single page.

Yep, but...there are those occassions when more than one page is 
needed.  These are the hardest requests to handle (or prepare for).
I was suggesting 'priming' the free memory pool for these cases (without 
introducing a large overhead - less restrictions on selecting a page to
reap == more chance of getting the page 'type' you need).

> > But they are not freed immediately.  Instead, update the necessary pte's 
> > (and anon structs), place the pages on a reclaim list, and tell the 
> > paging out daemon to start. 
> That is more or less what we have. Pages are kept in the page cache
> or buffer pool as long as possible even when unused. Kswapd keeps
> the actual free lists topped up to minimise overhead on allocation
> requests plus requests fall back to calling try_to_free_page
> themselves if necessary.

Think you might have missed my point (or I didn't make it v. well),  or 
I'm mis-understanding your reply.  What you say above is correct.  My point 
was extending the 'swap-cache', to make it more general (into a 
"relcaim-list").
The current swap-cache is used to prevent a thrash case on the swap 
device after an anonymous page has been unmapped from an address space and 
the process references it v. soon after. (Remember, Linux has asynchronous 
writes to swap).

> > OK, I've simplified the above quite a bit.  But do you get the idea?
> > If you think you're seen the above before, you're correct.  It's what 
> > some other UNIX OSes do, and for good reasons.
> 
> Linux is not entirely dissimilar but it is optimised more for the
> "sufficient memory" case - there is no way to go from a page to
> the page tables it is used in.

Where the lack of physical-page to pte(s) mapping causes the most 
problems (for me anyway) is in a "make -j zImage" on the kernel
source.
When Linux identifies a named page to be reaped, it _only_ removes it 
from the address space under the process that it found the page.  There 
maybe other processes that still have a ref to the page.  Hence, all the 
searching work has been done for zero gain - OK, not totally zero.  At 
least when it stumbles over the other mappings for the page it will be 
released.
I'm being a bit harsh here.  It's not very often that a single text 
page has many mappings, and needs to be unmapped.  Plus the text page 
usage patterns in my example probably means releasing such a page is not 
good - it would soon be faulted in again.

Also note the line "if (page_map->count != 1) return 0;" in 
try_to_swap_out().  This prevents the paging out of shared anonymous 
pages.  Only effects some pages, but I'd like to get rid of it (hence 
the need for anon structs).

> > 		     and improves the performance of swap (when reading
> > 		     back, we read several pages back - locality of
> > 		     reference on a none page granularity). 
> Linux also does page ahead.

Unless I've missed something, there is no read-ahead on a swap device!  
(No point as pages are not guaranteed any order on swap, as I was 
proposing).
Clustering of swap pages, yes, but that's something different. Have a look 
at scan_swap_map() for the clustering (been sometime since I looked at 
this func, but I seem to remember the idea is to prevent a disk head from 
moving all over a swap device writing single pages in the holes left by 
fault-ins from the device).

> > 		iii) Identify pages that allow coalescing in the free
> > 		     memory pool.
> 
> I reclaim unused pages from the page cache when coalescing and it
> seems worthwhile. It needs work to mutex things so we can do it
> at interrupt time as well though.

Good.  I though it would be messey (is it?).  I was thinking about using 
the age of pages in the cache when deciding what would coalesce with 
pages in the free memory pool.  Perhaps using a random selection policy 
for a page is good enough (is this what you're doing?).  By random I 
mean pseudo-random, ie. the first page we find, from the previous 
starting point, which allows a coalesce.

> > See my slab allocator on http://www.nextd.demon.co.uk/ 
> Yes, I've been looking at it. There are some interesting ideas. But it
> is relatively slow by my current standards :-). PCs have all too
> little L1 cache, all but the oldest processors have enough pipelining
> to be able to init a structure pretty quickly and conditional branches
> are a killer - even with the Cyrix branch prediction. It may be
> possible to optimise slab though...

Slow? Slow!  Maybe in the v. _worst_ case.  The common, critical, path is 
short, with few branches (haven't checked the assembly - I should do! - 
just relied on how I believe gcc produces code. eg. "if() goto ???" when 
the  'if' is rarely taken, and "if () {common-case}" when the 'if' will 
normally be taken.  But I should check what it produces...).
If some of the handlers to allow different types of packing are removed, 
then the paths can be made even short.  A compromise is always needed; 
using memory efficiently, against the instruction count.  Yes, it needs a 
bit of tuning (already being done), but not much.
As for the L1 cache?  A structure is usually referenced many times before 
it is freed, so helping whatever L1 cache is available is well worthwhile.
(Note, my critical paths have no extra instructions to allow for cache 
alignment).  Init-ing a struct inline may be quick, but it will 
still place instructions into both types of caches (instruction and 
data).  If a constructed state can be maintained for a structure (and 
hence, some code out of the 'hot' path), then it should be.

>   (It still does the silly power-of-two sizing though, doesn't it?)

Most definitely not, walk the code (or ask me questions)!
In my patch the 'general purpose caches' are 2^n because I was rushed in 
getting it out and couldn't be bothered to tune the sizes (not a good 
recommendation on my part...).  The only power of twos are; the num of 
pages per slab (due to get_free_page()), bytes per page (h/w constraint), 
and the size of cache lines (because that's the way natural intended them).

BTW, lets not knock power-of-two.  They are nice numbers (often your're 
best friend when efficiency is concerned).  Be carefully of using 
Fibonacci sequences, or some other sets, for allocator sizes.  Their 
internal fragmentation tends to become high over time (not good for an 
OS).  None-coalescing, splitting allocators tend to show the same 
behaviour.  If you know the usage patterns, it is possible to design 
efficient allocators for specific applications.  Difficult to do in a 
kernel.

Regards,

   markhe

From: Mike Jagdis <m...@roan.co.uk>
Subject: Re: Please don't beat me up 
(was Re: Bugs and wishes in memory management area)
Date: 1996/12/02
Message-ID: <Pine.LNX.3.91.961202120038.1602A-100000@toaster.roan.co.uk>
X-Deja-AN: 202046718
sender: owner-linux-ker...@vger.rutgers.edu
references: <Pine.LNX.3.91.961129202137.142A-100000@nextd.demon.co.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
organisation: Roan Technology Ltd
x-hdr-sender: m...@roan.co.uk
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel

> Yep, but...there are those occassions when more than one page is 
> needed.  These are the hardest requests to handle (or prepare for).
> I was suggesting 'priming' the free memory pool for these cases (without 
> introducing a large overhead - less restrictions on selecting a page to
> reap == more chance of getting the page 'type' you need).

My problem with preparing for multi-page allocations in advance is
that you incur overheads coalescing pages into blocks and then splitting
them again. This costs on things like network latency and bandwidth
for instance. My preferred approach at the moment is to take the
hit on first allocation of a particular size but to tend to keep
requested sizes around for a while so that, if this is other than
a one off, further requests can be satisfied with due speed. I am
also considering the wisdom of doing some coalescing during idle
time to keep a few suitably large blocks around in order to minimize
the miss on the first request in particular. I haven't implemented
this however...

> Think you might have missed my point (or I didn't make it v. well),  or 
> I'm mis-understanding your reply.  What you say above is correct.  My point 
> was extending the 'swap-cache', to make it more general (into a 
> "relcaim-list").

Ah, I was thinking more of the page cache. However, aren't pages in
the swap cache also in the page cache? I seem to remember that reclaiming
from the page cache needs a delete_from_swap_cache. Are we getting the
swap and page caches confused?

> When Linux identifies a named page to be reaped, it _only_ removes it 
> from the address space under the process that it found the page.  There 
> maybe other processes that still have a ref to the page.  Hence, all the 
> searching work has been done for zero gain - OK, not totally zero.  At 
> least when it stumbles over the other mappings for the page it will be 
> released.

Yes, this could get expensive (the extra invalidates are probably
not good for SMP either). I tend to look on this as a seperate
problem for now however.

> Good.  I though it would be messey (is it?).  I was thinking about using 
> the age of pages in the cache when deciding what would coalesce with 
> pages in the free memory pool.  Perhaps using a random selection policy 
> for a page is good enough (is this what you're doing?).  By random I 
> mean pseudo-random, ie. the first page we find, from the previous 
> starting point, which allows a coalesce.

Essentially random, yes, since there is no way to predict the current
distribution of pages and their uses. It's simplistic in that it
doesn't look ahead to see if it will ultimately be able to build
a block big enough but seems to balance out well since over coalescing
on one request increases the chance of future requests trivially
succeeding - so the spread of latencies widens but the average stays
reasonably good. Of course, a one off request for 1MB of contiguous
DMA-able memory is always going to be somewhat anti-social :-).

> Slow? Slow!  Maybe in the v. _worst_ case.  The common, critical, path is 
> short, with few branches (haven't checked the assembly - I should do! - 
> just relied on how I believe gcc produces code. eg. "if() goto ???" when 
> the  'if' is rarely taken, and "if () {common-case}" when the 'if' will 
> normally be taken.  But I should check what it produces...).

My timings show it often slower than the current stock kernel for the
simple, benchmarkable, cases whereas I can go faster fairly consistently
(without modifying anything other than kmalloc.c and page_alloc.c).
I haven't looked closely enough to say where the time goes though.

> As for the L1 cache?  A structure is usually referenced many times before 
> it is freed, so helping whatever L1 cache is available is well worthwhile.
> (Note, my critical paths have no extra instructions to allow for cache 
> alignment).  Init-ing a struct inline may be quick, but it will 
> still place instructions into both types of caches (instruction and 
> data).  If a constructed state can be maintained for a structure (and 
> hence, some code out of the 'hot' path), then it should be.

True enough but the instructions to init a structure are generally
pretty fast, easily pipelined and inline so in-cache. The structure
they are writing to is almost certain to be either in cache or likely
to be accessed in the very near future anyway. The assumption is
that a "recently" freed structure will be reused in the "near" future
while it is still in cache. I don't know the precise details of other
chips but on the 6x86 the L1 is "only" 16KB organised as 512 lines
by 32 bytes. How long will an L1 line remain resident? I went for
the "fast reuse of the right size block" rather than the instruction
saving. Unfortunately I don't know anyway to track cache hits. (I seem
to remember that the Alpha has some cache statistics capabilities
but that's out of my reach :-( ).

> >   (It still does the silly power-of-two sizing though, doesn't it?)
> 
> Most definitely not, walk the code (or ask me questions)!
> In my patch the 'general purpose caches' are 2^n because I was rushed in 
> getting it out and couldn't be bothered to tune the sizes (not a good 
> recommendation on my part...).

Ah, I saw the table and jumped to the conclusion that it was significant.

				Mike

-- 
.----------------------------------------------------------------------.
|  Mike Jagdis                  |  Internet:  mailto:m...@roan.co.uk   |
|  Roan Technology Ltd.         |                                      |
|  54A Peach Street, Wokingham  |  Telephone:  +44 118 989 0403        |
|  RG40 1XG, ENGLAND            |  Fax:        +44 118 989 1195        |
`----------------------------------------------------------------------'