TODO list for new VM

From: Rik van Riel <r...@conectiva.com.br>
Subject: TODO list for new VM
Date: 2000/09/16
Message-ID: <linux.kernel.Pine.LNX.4.21.0009160544000.1519-100000@duckman.distro.conectiva>#1/1
X-Deja-AN: 670474882
Approved: n...@nntp-server.caltech.edu
X-To: linux...@kvack.org
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
X-cc: linux-ker...@vger.kernel.org, Linus Torvalds 
<torva...@transmeta.com>, Matthew Dillon <dil...@apollo.backplane.com>
Newsgroups: mlist.linux.kernel

Hi,

Here is the TODO list for the new VM. The only thing
really needed for 2.4 is the OOM handler and the
page->mapping->flush() callback is really wanted by
the journaling filesystem folks.

The rest are mostly extra's that would be nice; these
things won't be pushed for inclusion except if it turns
out to be really trivial to implement, high performance
on the cases they're supposed to affect and their influence
is highly localised...

(sorry folks, but for 2.4 I'll be really conservative)

---> TODO list for the new VM <---

for kernel 2.4, necessary:
- out of memory handling
	[integrate the OOM killer, 10 minutes work]

for kernel 2.4, really wanted:
- page->mapping->flush() callback in page_launder(),
  for easier integration with journaling filesystems
  and maybe the network filesystems
	[about 30 minutes of work on the VM side]

for kernel 2.4, wanted:
- include Ben LaHaise's code, which moves readahead
  to the VMA level, this way we can do streaming swap
  IO, complete with drop_behind()
- code to make the "knee" smoother, currently the system
  keeps eating memory from the cache up to a certain point
  and then starts to swap a lot, it would be nice to smooth
  this curve a bit
- thrashing control, maybe process suspension with some
  forced swapping ?

for kernel 2.5:
- physical->virtual reverse mapping, so we can do much
  better page aging with less CPU usage spikes
- better IO clustering for swap (and filesystem) IO
- move all the global VM variables, lists, etc. into
  the pgdat struct for better NUMA scalability
- (maybe) some QoS things, as far as they are major
  improvements with minor intrusion

regards,

Rik
--
"What you're running that piece of s*** Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@conectiva.com.br>
Subject: TODO list for new VM  (oct 2000)
Date: 2000/10/02
Message-ID: <linux.kernel.Pine.LNX.4.21.0010021447430.22539-100000@duckman.distro.conectiva>#1/1
X-Deja-AN: 676743861
Approved: n...@nntp-server.caltech.edu
X-To: linux-ker...@vger.kernel.org
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
X-cc: linux...@kvack.org, Matthew Dillon <dil...@apollo.backplane.com>, 
Linus Torvalds <torva...@transmeta.com>
Newsgroups: mlist.linux.kernel

[MM TODO list, updated for october 2000]

---
Here is the TODO list for the new VM. The only thing
really needed for 2.4 is the OOM handler and a fix
for the highmem deadlock.

The page->mapping->flush() callback is really wanted
by the journaling filesystem folks.

The rest are mostly extra's that would be nice; these
things won't be pushed for inclusion except if it turns
out to be really trivial to implement, high performance
on the cases they're supposed to affect and their influence
is highly localised...

(sorry folks, but for 2.4 I'll be really conservative)

---> TODO list for the new VM <---

for kernel 2.4, necessary:
- out of memory handling
	[integrate the OOM killer, 10 minutes work]
- fix the highmem deadlock, where the swapper cannot create
  low memory bounce buffers OR swap out low memory because
  it has consumed all resources
	[old bug, already reported with 2.4.0-test6, probably before]

for kernel 2.4, really wanted:
- page->mapping->flush() callback in page_launder(),
  for easier integration with journaling filesystems
  and maybe the network filesystems
	[about 30 minutes of work on the VM side]

for kernel 2.4, wanted:
- maybe rebalance the swapper a bit ... we do page aging
  now so maybe refill_inactive_scan() / shm_swap() and
  swap_out() need to be rebalanced a bit

for kernel 2.5:    (maybe available as patch for 2.4 ???)
- physical->virtual reverse mapping, so we can do much
  better page aging with less CPU usage spikes
- better IO clustering for swap (and filesystem) IO
- move all the global VM variables, lists, etc. into
  the pgdat struct for better NUMA scalability
- (maybe) some QoS things, as far as they are major
  improvements with minor intrusion
- thrashing control, maybe process suspension with some
  forced swapping ?
- include Ben LaHaise's code, which moves readahead
  to the VMA level, this way we can do streaming swap
  IO, complete with drop_behind()

regards,

Rik
--
"What you're running that piece of s*** Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: TODO list for new VM  (oct 2000)
Date: 2000/10/02
Message-ID: <linux.kernel.Pine.LNX.4.10.10010021117540.828-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 676743872
Approved: n...@nntp-server.caltech.edu
X-To: Rik van Riel <r...@conectiva.com.br>
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
X-cc: linux-ker...@vger.kernel.org, linux...@kvack.org, 
Matthew Dillon <dil...@apollo.backplane.com>
Newsgroups: mlist.linux.kernel

Why do you apparently ignore the fact that page-out write-back performance
is horribly crappy because it always starts out doing synchronous writes?

I pointed out previously in a private email that page_launder() must be
buggy as it stands now, you seem to have ignored that part (and the
test-program that shows 1MB/s writeout speeds due to it) completely.

The whole _point_ of the new VM was performance. Without that, the new VM
is pointless, and discussing TODO features is equally pointless.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@conectiva.com.br>
Subject: Re: TODO list for new VM  (oct 2000)
Date: 2000/10/02
Message-ID: <linux.kernel.Pine.LNX.4.21.0010021524140.22539-100000@duckman.distro.conectiva>#1/1
X-Deja-AN: 676743875
Approved: n...@nntp-server.caltech.edu
X-To: Linus Torvalds <torva...@transmeta.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
X-cc: linux-ker...@vger.kernel.org, linux...@kvack.org, 
Matthew Dillon <dil...@apollo.backplane.com>
Newsgroups: mlist.linux.kernel

On Mon, 2 Oct 2000, Linus Torvalds wrote:

> Why do you apparently ignore the fact that page-out write-back
> performance is horribly crappy because it always starts out
> doing synchronous writes?

Because it is fixed in the patch I mailed yesterday?

regards,

Rik
--
"What you're running that piece of s*** Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel <r...@conectiva.com.br>
Subject: Re: TODO list for new VM  (oct 2000)
Date: 2000/10/02
Message-ID: <linux.kernel.Pine.LNX.4.21.0010021531360.22539-100000@duckman.distro.conectiva>#1/1
X-Deja-AN: 676743870
Approved: n...@nntp-server.caltech.edu
X-To: Linus Torvalds <torva...@transmeta.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
X-cc: linux-ker...@vger.kernel.org, linux...@kvack.org, 
Matthew Dillon <dil...@apollo.backplane.com>
Newsgroups: mlist.linux.kernel

On Mon, 2 Oct 2000, Rik van Riel wrote:
> On Mon, 2 Oct 2000, Linus Torvalds wrote:
> 
> > Why do you apparently ignore the fact that page-out write-back
> > performance is horribly crappy because it always starts out
> > doing synchronous writes?
> 
> Because it is fixed in the patch I mailed yesterday?

One small warning though. Please don't apply that patch
yet because I fixed 3 more small problems today. I'll
send you an updated patch...

- the compile warnings are fixed
- in try_to_free_pages(), we forgot to set
  PF_MEMALLOC in current->flags  (oops)
- in grow_buffers(), in case we cannot get a
  buffer head, we must unlock the page

A patch against 2.4.0-test9-pre8 with these 3 changes will
be on its way once I've tested it a bit...

regards,

Rik
--
"What you're running that piece of s*** Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

From: Matthew Dillon <dil...@apollo.backplane.com>
Subject: Re: TODO list for new VM  (oct 2000)
Date: 2000/10/04
Message-ID: <linux.kernel.200010050108.SAA83892@apollo.backplane.com>#1/1
X-Deja-AN: 677710359
Approved: n...@nntp-server.caltech.edu
X-To: Rik van Riel <r...@conectiva.com.br>
X-Cc: Linus Torvalds <torva...@transmeta.com>, linux-ker...@vger.kernel.org, 
linux...@kvack.org, Matthew Dillon <dil...@apollo.backplane.com>
Newsgroups: mlist.linux.kernel

:On Mon, 2 Oct 2000, Rik van Riel wrote:
:> On Mon, 2 Oct 2000, Linus Torvalds wrote:
:> 
:> > Why do you apparently ignore the fact that page-out write-back
:> > performance is horribly crappy because it always starts out
:> > doing synchronous writes?
:> 
:> Because it is fixed in the patch I mailed yesterday?
:
:One small warning though. Please don't apply that patch
:yet because I fixed 3 more small problems today. I'll
:send you an updated patch...
:...
:regards,
:
:Rik

    My experience with FreeBSD's asynchronous paging
    is that you have to carefully limit the number of
    I/O's you queue at once.  Or, more specifically, you
    have to limit the seeking load the async pageouts
    place on the system.

    The performance curve from the point of user processes 
    in the system looks like a bell, while the paging
    performance looks like a log curve (increased performance
    with diminishing returns)... if you queue too few
    pages (degenerate into synchronous paging), you have low
    paging performance and high user process performance,
    but you can't clean pages fast enough in a heavily loaded
    system.  If you queue too many pages at once, you have
    high paging performance (but with diminishing returns)
    and low user process performance due to the seeking
    load you've placed on the disk.  Excessive seeking
    from pageouts will ruin the disk's performance from
    the point of view of other processes in the system.

    FreeBSD has a sysctl variable called vm.max_page_launder
    which limits the number of pages the pageout daemon
    will queue to I/O at once.  The default is 32.   Numbers
    between 16 and 32 were found to fit the sweet spot of
    the curve the best.  Numbers lower then 16 reduced
    system performance because potentially contiguous pageouts
    would get split (causing more seeking rather then less when
    mixed with I/O initiated from user processes), and numbers
    higher then 32 reduced user process performance due to the
    additional seeking from the queued pageouts.

    The sysadmin can adjust the value to effectively give
    paging more or less priority.  A smaller number reduces
    paging performance but increasing system performance
    for other processes (though anything less then 4 will
    reduce performance for everyone).  A higher number
    increases paging performance at the cost of system
    performance for other processes.  Virtually all FreeBSD
    installations that I know about leave the sysctl variable
    alone.

    Note that the performance bell holds true whether you
    sort disk requests or not, the whole bell simply moves up
    or down on the graph.

    There are a number of things that can be done to mitigate
    the seeking issue, which I discussed with Rik a few months
    ago.  The jist of it, though, is that there is a trade-off
    between page-in and page-out performance based on how you
    try to cluster swap allocation.  FreeBSD clusters swap
    allocations to optimize page-out performance at the cost
    of page-in performance and that seems to work very
    well under heavy system loads.

					-Matt
					Matthew Dillon 
					<dil...@backplane.com>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/