2.4.10 VM vs. 2.4.9-ac14

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: loke.as.arizona.edu: ckulesa owned process doing -bs
Original-Date: 	Mon, 24 Sep 2001 05:08:49 -0700 (MST)
From: Craig Kulesa <ckul...@as.arizona.edu>
To: <linux-ker...@vger.kernel.org>
Subject: 2.4.10 VM vs. 2.4.9-ac14 (+ ac14-aging)
Original-Message-ID: <Pine.LNX.4.33.0109232255250.14107-100000@loke.as.arizona.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Mon, 24 Sep 2001 12:13:00 GMT
Message-ID: <fa.k18fuuv.1uksjgg@ifi.uio.no>
Lines: 146

Well, things are looking up.  The split trees of 2.4 VM seem to be both
performing "pretty well" here.  Following are some tests and comments
about recent kernels that I hope will be vaguely illuminative toward
further improvement.

Description of tests:

- Streaming IO test:
  dbench, 'dd if=/dev/zero of=dummy.dat bs=1024k count=512' and
  'cat dummy.dat > /dev/null' while performing streaming tasks like
  mp3's and general interactive use.  This is obscene, but dirty
  page overloading needs to be handled at least *acceptably*, without
  resorting to low-latency or preemptible patches

- Common user application test: the idea is to load a mix of applications
  to drive the system into different kinds of memory loads.  [sequential]

	a) fill dentry/inode caches with slocate
	b) create lots of anonymous pages w/ a large blank image in GIMP
	   [make sure GIMP's tile cache is set to a high value to test
	     kernel VM and not GIMP's 'temp' file handling]
	c) loading StarOffice w/file creates lots of disk i/o,
	   stretches VM cache and then allocates lots of user
	   memory... [report loading time]
	d) load suite of apps to drive the system into mild
	   swap *activity* (not just swap allocation in 2.4.9-ac)
	e) Now that some pages have aged a bit, try to rotate that GIMP
	   image (major use of "older" anon pages and creation of many
	   more)   [time the rotation]
	f) note WHO's paged out w/ ps, log to file
	g) close all apps sequentially, sorta LIFO, note swap-ins

	vmstat & periodic dumps from /proc/meminfo and /proc/slabinfo log
	all statistics throughout the tests

Summary of Results:

- Test machines ranged from 32 MB to 192 MB, the latter is described
  here.

- 2.4.8 and 2.4.9 were poor, degenerating to _awful_ somewhere in
  2.4.10-pre. Example: it was darn near impossible to evict dentry and
  inode caches in 2.4.8.  Also, freshly loaded apps were paged out under
  load, then repeatedly paged back in, then back out... (poor interaction
  and/or balancing between the various inactive lists, coupled presumably
  w/ broken aging).

2.4.8 streaming IO test: failed (stutters, huge gaps in playback)
2.4.8 app test:  45524 kB swapped out; 29638 kB swapped in (cumulative)
		 28 second StarOffice load time; 10 sec GIMP img rotate

- 2.4.10-pre11 changed the nature of the VM problems, but most major
  issues seem to have been fixed by pre14 (certainly 2.4.10 final).  pre11
  would spin in kswapd & 'somewhere else' (balance classzone?) --
  sometimes loading StarOffice 5.2 would take 50% longer due partly to
  kswapd; no pages were actually swapped out.  Fixed by/before pre14.
  Even in 2.4.10 final, choice of evicted pages is not always good (many
  more cumulative swapins than ac14 when apps are closed).  Performance is
  otherwise pretty impressive.

2.4.10 streaming IO test: failed (stutters, frequent gaps in playback)
2.4.10 app test: 30020 kB swapped out; 22308 kB swapped in (cumulative)
		 22 second StarOffice load time; 6-7 sec GIMP img rotate

- 2.4.9-ac1* has pretty consistent, functioning VM.  Looks like aging is
  still mildly broken. Performance however is quite excellent for the most
  part; cache contains the "right pages" and what is paged out is "mostly
  the right pages".  Recent 2.4.9-ac (ac14 tested) had the best streaming
  I/O interactivity; it also outperforms everything else until lots of
  anonymous pages have to be allocated in swapcache (esp. when you're
  talking about a large scientific simulation on a HIGHMEM box; see Dirk
  Wetter's posts from around 12 July 2001 and Marcelo's comments).

2.4.9-ac14 streaming IO test:  passed, skip-less playback
			       (ac14-aging patch results identical)
2.4.9-ac14 app test: 30968 kB swapped out; 12900 kB swapped back in
		 18 second StarOffice load time; 8 sec GIMP img rotate
ac14+aging, app test: 31664 kB swapped out; 14604 kB swapped back in
		 18 second StarOffice load time; 8 sec GIMP img rotate

As above, Rik's latest ac14-aging was tested.  It, like ac12-aging, has
performed pretty well.  I'm not sure that it's doing all the right things
in detail.  For example, plain ac14 swapped out just as many pages,
but swapped fewer of them back in when the apps were closed.  Inactive
daemons loaded at boot time are among the oldest pages on the system;
ac14 swapped them entirely out.  2.4.10 and ac14+aging had similar
behavior and only paged them a little (ex. out of 2 MB=SIZE, 0.5 MB was
still RSS) and hit loaded 'younger' loaded apps (with big RSS)
somewhat harder instead.  Not sure if that's right; pure aging should
presumably page the unused daemons first, but drawing from big, idle hogs
might be more fruitful?  The aging patch simplifies the code a bit, and I
think that's a good thing.

ac14-aging easily collapses the dentry and inode caches under load.  This
works well here, but others might want to check to see if it's _too_
aggressive.  Suspect it's okay...

Rik's page launder patch for ac12 was also applied to ac14; it failed
the streaming IO test.  ac14 and ac14+aging were the only tested
kernels to pass.  No preemptive kernel patches were applied.

Comments:

I dunno what to think about the split VM trees.  The traditional
2.4 VM looks quite good in latest 2.4.9-ac, could stand addn'l careful
analysis & pruning.  I suspect most of the problems relate to inactive
lists interacting/balancing badly with each other, but the overall design
seems sensible.  Much of it is pretty well documented (even *I* can
follow it in some kind of coarse sense) & that effort is deeply
appreciated.  Andrea's classzone approach reduces inactive list
complexity, but I remain confused about the classzone design itself.
[Have to look at it more; rather new at this.]

I mean, I look at 'traditional' 2.4 VM and wonder why it sometimes
doesn't work like it should; in contrast, I look at classzone and wonder
how/why it manages to work so well. :)

Totally IMHO, my VM wishlist for 2.5 would be to see the return of some
aspects of 2.4 VM that got nixed.  I liked the overall design, although
implementation of inactive-lists/anon-pages needs to be made more
maintainable.  In particular, so-called 'anonymous' pages *have* to be
handled in a more sensible way.  Dump them in the active list (?),
allocate them in a separate fs from what-will-hopefully-become-swapfs-in-
2.5, or *something*.  Improved get_swap_page(), swap_out() & associates
probably should be on that list somewhere.

But things are looking *much* better now -- a real huge 'thank you' is in
order. :)  And looking forward to testing patches, and 2.5...

Best regards to all,

Craig Kulesa
Univ. of Arizona, Steward Observatory

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Tue, 25 Sep 2001 20:51:43 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Craig Kulesa <ckul...@as.arizona.edu>
Cc: linux-ker...@vger.kernel.org
Subject: Re: 2.4.10 VM vs. 2.4.9-ac14 (+ ac14-aging)
Original-Message-ID: <20010925205143.C8350@athlon.random>
Original-References: <Pine.LNX.4.33.0109232255250.14107-100...@loke.as.arizona.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109232255250.14107-100000@loke.as.arizona.edu>; 
from ckulesa@as.arizona.edu on Mon, Sep 24, 2001 at 05:08:49AM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Tue, 25 Sep 2001 18:53:06 GMT
Message-ID: <fa.go19h3v.66i7pb@ifi.uio.no>
References: <fa.k18fuuv.1uksjgg@ifi.uio.no>
Lines: 14

On Mon, Sep 24, 2001 at 05:08:49AM -0700, Craig Kulesa wrote:
> 2.4.10 streaming IO test: failed (stutters, frequent gaps in playback)
> 2.4.10 app test: 30020 kB swapped out; 22308 kB swapped in (cumulative)
> 		 22 second StarOffice load time; 6-7 sec GIMP img rotate

I'd appreciate if you could repeat the test with vm-tweaks-1 applied to
see the difference.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!
small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: loke.as.arizona.edu: ckulesa owned process doing -bs
Original-Date: 	Wed, 26 Sep 2001 06:38:48 -0700 (MST)
From: Craig Kulesa <ckul...@as.arizona.edu>
To: <linux-ker...@vger.kernel.org>
Subject: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)
Original-Message-ID: <Pine.LNX.4.33.0109260617450.3929-100000@loke.as.arizona.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 26 Sep 2001 13:41:27 GMT
Message-ID: <fa.l1grl8v.fkk201@ifi.uio.no>
Lines: 124

As requested, here are a number of tests of the latest VM patches.  Tests
are described in a previous post, archived here:

    http://www.uwsg.indiana.edu/hypermail/linux/kernel/0109.3/0033.html

Results:

2.4.10 performance is great compared to 2.4.[7-9], but these tests
still seem to point out some room for improvement in the 2.4.10 VM tree.
2.4.10 and 2.4.10(+00_vm-tweaks-1) performed similarly.  The vm-tweaks
patch improved the swap smoothness, but the number of pages swapped out
didn't change measurably, nor did the large number of swap-ins.  Clogging
the system with dirty pages via 'dd' still causes XMMS to skip badly.

Let's push the aging/list-order code more by driving the system a bit
harder in step d), namely adding mozilla to the common user application
test.  We will also stream mp3 audio throughout the entire test.

2.4.10(+00_vm-tweaks-1)
	48 sec StarOffice load time
	28 sec 2560x2560 GIMP image rotation
	82400 KB swapped out, 92148 KB swapped back in

2.4.9-ac14 + aging
	33 sec StarOffice load time
	25 sec GIMP image rotation
	30072 KB swapped out, 22252 KB swapped back in

2.4.9-ac15 + aging + launder
	33 sec StarOffice load time
	24 sec GIMP image rotation
	57556 KB swapped out, 25900 KB swapped back in

'vmstat 1' sessions for these three cases are available at:
	http://loke.as.arizona.edu/~ckulesa/kernel/

2.4.10+ is clearly working a LOT harder to keep dentry and inode caches
in memory, and is swapping out harder to compensate.  The ac14/ac15 tree
frees those caches more freely, and don't page application working sets
out so readily.

Let's test this statement by not pre-filling the inode and dentry caches
with 'slocate' and performing the same test:

2.4.10(+00_vm-tweaks)
	26 sec StarOffice load time
	24 sec GIMP image rotation
	48332 KB swapped out, 33521 KB swapped back in

2.4.9-ac14 + aging
	32 sec StarOffice load time
	26 sec GIMP image rotation
	37392 KB swapped out, 11952 KB swapped back in

2.4.9-ac15 + aging + launder
	32 sec StarOffice load time
	22 second GIMP image rotation
	23884 KB swapped out, 10828 KB swapped back in

2.4.10 does much better this time; in particular the StarOffice loading
that was so plagued by swapouts, pressured by dentry/inode caching last
time, went smoothly.  But there's still more paging than with
2.4.9-ac1[4-5].

Let's try one more aging/list-order experiment.  Instead of creating a
2560x2560 GIMP image first, then loading StarOffice and many other
applications after (to start swapping, and cause GIMP pages to be
candidates for reaping) -- this time let's load StarOffice first and then
create the GIMP image.  This should keep the GIMP image at a 'younger' age
and presumably shouldn't page back into memory (rotation should be
faster).  StarOffice may swap itself entirely out however.

2.4.10(+00_vm-tweaks)
	25 sec StarOffice load time
	29 sec GIMP image rotation
	64427 KB swapped out, 77422 KB swapped back in

2.4.9-ac14 + aging
	30 sec StarOffice load time
	24 sec GIMP image rotation
	22147 KB swapped out, 8922 swapped back in

2.4.9-ac15 + aging + launder
	31 sec StarOffice load time
	21 second GIMP image rotation
	17204 KB swapped out, 8224 swapped back in

The 2.4.10 behavior surprised me.  The GIMP pages are younger in memory,
yet the rotation was slowed by swapin & swapout activity --  slower than
before. Plus more StarOffice pages were swapped out, so it had to be paged
back in order to close the application.  I'm puzzled.  The ac14/ac15
behavior was closer to what I expected; the GIMP pages were young and
unswapped, only the earliest StarOffice pages had to be recalled.

These are samples of rather 'ordinary' loads which 2.4.10 needs some work
handling; the ac15 tree is doing a better job with this particular set
right now (ac15 tree also doesn't skip XMMS with the creation of lots of
dirty pages via 'dd').  But all three kernels tested kept the user
interface relatively responsive, which is an improvement over previous
2.4 releases.  Very cool.

A note on page_launder().  ac14 has the smoothest swapping, with small
chunks laundered at a time.  ac14+aging and ac15+aging+launder both swap
out huge (10-20 MB) chunks at a time.  Admittedly, the user interface is
responsive and XMMS doesn't skip a beat, but most of the 60 MB of
actual swapout in the first test in ac15+stuff came from only THREE
lines of 'vmstat 1' output.  Otherwise there was no swapout activity.

Best regards, and thanks for the excellent work!

Craig Kulesa
Steward Observatory, Univ. of Arizona

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!news.tele.dk!
small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 26 Sep 2001 16:03:47 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Craig Kulesa <ckul...@as.arizona.edu>
Cc: linux-ker...@vger.kernel.org
Subject: Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)
Original-Message-ID: <20010926160347.F27945@athlon.random>
Original-References: <Pine.LNX.4.33.0109260617450.3929-100...@loke.as.arizona.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109260617450.3929-100000@loke.as.arizona.edu>; 
from ckulesa@as.arizona.edu on Wed, Sep 26, 2001 at 06:38:48AM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 26 Sep 2001 14:05:38 GMT
Message-ID: <fa.ebjq3mv.1mgutbl@ifi.uio.no>
References: <fa.l1grl8v.fkk201@ifi.uio.no>
Lines: 14

On Wed, Sep 26, 2001 at 06:38:48AM -0700, Craig Kulesa wrote:
> in memory, and is swapping out harder to compensate.  The ac14/ac15 tree

2.4.10 is swapping out more also because I don't keep track of which
pages are just uptodate on the swap space. This will be fixed as soon as
I teach get_swap_page to collect away from the swapcache mapped
exclusive swap pages.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!195.158.233.21!news1.ebone.net!
news.ebone.net!news1.fra.nextra.com!news2.oke.nextra.no!nextra.com!
uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 26 Sep 2001 11:23:44 -0300 (BRST)
From: Rik van Riel <r...@conectiva.com.br>
X-X-Sender:  <r...@imladris.rielhome.conectiva>
To: Andrea Arcangeli <and...@suse.de>
Cc: Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>
Subject: Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)
In-Reply-To: <20010926160347.F27945@athlon.random>
Original-Message-ID: <Pine.LNX.4.33L.0109261123070.19147-100000@imladris.rielhome.conectiva>
X-spambait: aardv...@kernelnewbies.org
X-spammeplease: 	aardv...@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 26 Sep 2001 14:25:28 GMT
Message-ID: <fa.q4q8otv.1ulom87@ifi.uio.no>
References: <fa.ebjq3mv.1mgutbl@ifi.uio.no>
Lines: 26

On Wed, 26 Sep 2001, Andrea Arcangeli wrote:
> On Wed, Sep 26, 2001 at 06:38:48AM -0700, Craig Kulesa wrote:
> > in memory, and is swapping out harder to compensate.  The ac14/ac15 tree
>
> 2.4.10 is swapping out more also because I don't keep track of which
> pages are just uptodate on the swap space. This will be fixed as soon
> as I teach get_swap_page to collect away from the swapcache mapped
> exclusive swap pages.

Wouldn't that be easier to do from try_to_swap_out() ?

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardv...@nl.linux.org (spam digging piggy)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 26 Sep 2001 16:49:35 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Rik van Riel <r...@conectiva.com.br>
Cc: Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org
Subject: Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)
Original-Message-ID: <20010926164935.J27945@athlon.random>
Original-References: <20010926160347.F27...@athlon.random> 
<Pine.LNX.4.33L.0109261123070.19147-100...@imladris.rielhome.conectiva>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: 
<Pine.LNX.4.33L.0109261123070.19147-100000@imladris.rielhome.conectiva>; 
from riel@conectiva.com.br on Wed, Sep 26, 2001 at 11:23:44AM -0300
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 26 Sep 2001 14:52:43 GMT
Message-ID: <fa.eb3m4nv.1l0qsbj@ifi.uio.no>
References: <fa.q4q8otv.1ulom87@ifi.uio.no>
Lines: 26

On Wed, Sep 26, 2001 at 11:23:44AM -0300, Rik van Riel wrote:
> On Wed, 26 Sep 2001, Andrea Arcangeli wrote:
> > On Wed, Sep 26, 2001 at 06:38:48AM -0700, Craig Kulesa wrote:
> > > in memory, and is swapping out harder to compensate.  The ac14/ac15 tree
> >
> > 2.4.10 is swapping out more also because I don't keep track of which
> > pages are just uptodate on the swap space. This will be fixed as soon
> > as I teach get_swap_page to collect away from the swapcache mapped
> > exclusive swap pages.
> 
> Wouldn't that be easier to do from try_to_swap_out() ?

Of course that's a possibility but then we'd have to duplicate it in all
other get_swap_page callers, see?

And I think it much better fits hided in get_swap_page: the semantics of
get_swap_page() are "give to the caller a newly allocated swap entry".
So IMHO it is its own business to discard our "optimizations" to
generate a free swap entry in case all swap was just allocated.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
skynet.be!skynet.be!news.algonet.se!algonet!newsfeed1.uni2.dk!
news.net.uni-c.dk!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: ping.us.dell.com: robert owned process doing -bs
Original-Date: 	Wed, 26 Sep 2001 13:17:29 -0500 (CDT)
From: Robert Macaulay <robert_macau...@dell.com>
X-X-Sender:  <rob...@ping.us.dell.com>
Reply-To: Robert Macaulay <robert_macau...@dell.com>
To: Andrea Arcangeli <and...@suse.de>
cc: Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>
Subject: Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)
In-Reply-To: <20010926164935.J27945@athlon.random>
Original-Message-ID: <Pine.LNX.4.33.0109261310340.23259-100000@ping.us.dell.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 26 Sep 2001 18:19:09 GMT
Message-ID: <fa.kqf6llv.o7oqpt@ifi.uio.no>
References: <fa.eb3m4nv.1l0qsbj@ifi.uio.no>
Lines: 46

We've tried the 2.4.10 with vmtweaks2 on out machine with 8GB RAM. It was 
looking good for a while, until it just stopped. Here is what was 
happening on the machine.

I was ftping files into the box at a rate of about 8MB/sec. This continued 
until all the RAM was in the  cache column. This was earlier in the 
included vmstat output. The I started a dd if=/dev/sde of=/dev/null in a 
new window.

All was looking good until it just stopped. I captured the vmstat below. 
vmstat continued running for about 1 minute, then it died too. What other 
info can I provide?

 2  0  0   4148   3612  36088 7946652   0   0 15488    64 10216 23346   0  
11  88
 2  0  1   4148   6424  36100 7944288   0   0 11526    40 7107 15848   0  
18  82
 1  1  1   4132   5452  36112 7945444   0   0 11642  6208 7531 16983   0  
17  83
 2  1  1   4132   4972  36128 7946100   0   0 14272 11904 10651 24330   0  
13  87
 3  0  0   4132   4480  36144 7946588   0   0 13120  6760 11007 25144   0  
12  88
 0  1  0   4132   5312  36160 7944964   0   0 15616     0 9935 22793   0  
10  89
 0  3  1   4132   2924  36168 7947052   0   0  6727 11010 5049 11226   0  
26  74
 0  2  2   4132   2668  36168 7946396   0   0  1666  8598 2230  4598   0  
11  89
 0  2  2   4132   3776  36168 7946396   0   0     0     0  159     5   0   
0 100
 0  2  2   4132   3768  36168 7946396   0   0     0     0  121     5   0   
0 100
 0  2  2   4132   3760  36168 7946396   0   0     0     0  126     4   0   
0 100
 0  2  2   4132   3756  36168 7946396   0   0     0     0  139     4   0   
0 100
 0  2  2   4132   3756  36168 7946396   0   0     0     0  148     5   0   
0 100


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Wed, 26 Sep 2001 20:36:51 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Robert Macaulay <robert_macau...@dell.com>
Cc: Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org
Subject: Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)
Original-Message-ID: <20010926203651.Q27945@athlon.random>
Original-References: <20010926164935.J27...@athlon.random> 
<Pine.LNX.4.33.0109261310340.23259-100...@ping.us.dell.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="qMm9M+Fa2AknHoGS"
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109261310340.23259-100000@ping.us.dell.com>; 
from robert_macaulay@dell.com on Wed, Sep 26, 2001 at 01:17:29PM -0500
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Wed, 26 Sep 2001 18:40:32 GMT
Message-ID: <fa.ecj24fv.1ngutj7@ifi.uio.no>
References: <fa.kqf6llv.o7oqpt@ifi.uio.no>
Lines: 167


--qMm9M+Fa2AknHoGS
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Wed, Sep 26, 2001 at 01:17:29PM -0500, Robert Macaulay wrote:
> We've tried the 2.4.10 with vmtweaks2 on out machine with 8GB RAM. It was 
> looking good for a while, until it just stopped. Here is what was 
> happening on the machine.
> 
> I was ftping files into the box at a rate of about 8MB/sec. This continued 
> until all the RAM was in the  cache column. This was earlier in the 
> included vmstat output. The I started a dd if=/dev/sde of=/dev/null in a 
> new window.
> 
> All was looking good until it just stopped. I captured the vmstat below. 
> vmstat continued running for about 1 minute, then it died too. What other 
> info can I provide?

the best/first info in this case would be sysrq+T along with the system.map.

You may want to give a spin also to the patch in the attached email.

thanks,
Andrea

--qMm9M+Fa2AknHoGS
Content-Type: message/rfc822
Content-Disposition: inline

Date: Wed, 26 Sep 2001 16:45:42 +0200
From: Andrea Arcangeli <and...@suse.de>
To: "Oleg A. Yurlov" <k...@spylog.com>
Cc: linux-ker...@vger.kernel.org, Bob Matthews <bmatth...@redhat.com>,
	Linus Torvalds <torva...@transmeta.com>,
	Marcelo Tosatti <marc...@conectiva.com.br>,
	Rik van Riel <r...@conectiva.com.br>
Subject: Re: 2.4.10aa1 - 0-order allocation failed.
Message-ID: <20010926164542.I27...@athlon.random>
References: <1601012257268.20010926180...@spylog.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1601012257268.20010926180...@spylog.com>; 
from k...@spylog.com on Wed, Sep 26, 2001 at 06:07:48PM +0400
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc

On Wed, Sep 26, 2001 at 06:07:48PM +0400, Oleg A. Yurlov wrote:
> 
>         Hi, Andrea,
> 
>         We have next problem on our servers:
> 
> Sep 26 11:22:39 sol kernel: __alloc_pages: 0-order allocation failed (gfp=0x20/0)
> Sep 26 11:22:39 sol kernel: f048dd94 e02ab000 00000000 00000020 00000000 00000020 00000020 e298f820 
> Sep 26 11:22:39 sol kernel:        e298f844 00000001 e030a56c e030a6c4 00000020 00000000 e01382be 00000000 
> Sep 26 11:22:39 sol kernel:        e013874a e013488c 00000000 e298f820 00000202 e298f898 00000202 00000246 
> Sep 26 11:22:39 sol kernel: Call Trace: [put_dirty_page+122/132] [flush_old_exec+234/572] [sys_ustat+212/268] [kill_super+232/352] [unix_gc+394/748] 
> Sep 26 11:22:39 sol kernel:    [Unused_offset+27374/99203] [Unused_offset+12842/99203] [call_spurious_interrupt+14521/27705] [Unused_offset+43342/99203] [call_spurious_interrupt+14615/27705] [call_spurious_interrupt+16483/27705] 
> Sep 26 11:22:39 sol kernel:    [Unused_offset+90704/99203] [ipgre_rcv+233/636] [ipgre_rcv+503/636] [fcntl_getlk+327/624] [do_invalid_TSS+43/96] 
> Sep 26 11:22:39 sol kernel: __alloc_pages: 0-order allocation failed (gfp=0x20/0)
> Sep 26 11:22:39 sol kernel: f048ddd4 e02ab000 00000000 00000020 00000000 00000020 00000020 e298f820 
> Sep 26 11:22:39 sol kernel:        e298f844 00000001 e030a56c e030a6c4 00000020 00000000 e01382be 00000000 
> Sep 26 11:22:39 sol kernel:        e013874a e013488c 00000000 e298f820 00000202 e298f898 00000202 00000246 
> Sep 26 11:22:39 sol kernel: Call Trace: [put_dirty_page+122/132] [flush_old_exec+234/572] [sys_ustat+212/268] [kill_super+232/352] [unix_gc+394/748] 
> Sep 26 11:22:39 sol kernel:    [Unused_offset+27374/99203] [call_spurious_interrupt+13905/27705] [call_spurious_interrupt+17048/27705] [Unused_offset+90704/99203] [ipgre_rcv+233/636] [ipgre_rcv+503/636] 
> Sep 26 11:22:39 sol kernel:    [fcntl_getlk+327/624] [do_invalid_TSS+43/96] 

the system.map is wrong but this should be harmless, just a notice (if
you do the reverse lookup to find the address and you resolve the right
symbols we could make sure of that).

For driver writers (since it could be on topic with those GFP_ATOMIC
faliures): as I suggested to the SG folks make sure to never use
GFP_ATOMIC in normal kernel context, if you want lowlatency use GFP_NOIO
instead. GFP_NOIO can schedule (so you must release all the spinlocks
first) but it will never block on I/O so it will provide a small latency
too _but_ it will be able to shrink the clean cache so it is very unlikely
it will fail unless you have lots of dirty or mapped cache in ram.

>         Also, we see next in process status:
> 
> USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
> vz         927  0.0 625.1 43900 4267034752 ? S    08:10   0:00 hits
> vz        1030  0.0 625.1 43900 4267034752 ? S    08:11   0:00 hits
> vz        4561  1.3 625.1 45948 4267034724 ? S    10:48   0:00 hits
> root      4564  0.0  0.0  1460  548 pts/2    S    10:48   0:00 grep hits
> vz        4566  0.0 625.1 45948 4267034724 ? S    10:48   0:00 hits

Ben sent the fix for this one [Linus, you can find it on l-k if you
weren't cc'ed] (was a missing check in the tlb shootdown smp fixes) but
it's only a beauty issue, so really don't worry about it :)

>         After these errors we see some uninterruptable processes (with flag D in
> process  status),  gdb  say  that function "fdatasync" called and no returned...
> Soft reboot not work.
> 
>         Server   has   2  CPUs (Pentium III Katmai), 2Gb RAM, 2Gb swap, Hardware
> RAID (Mylex DAC960PTL1 PCI RAID Controller).
> 
>         Any ideas ?

Yes you have highmem.

Last night I spent one hour on the traces from Bob (btw, many thanks for
the helpful report Bob!) and the first suspect is the recent
GFP_NOHIGHIO logic.

Despite Bob's traces not obviously showing this, I think I can see a
potential problem with writepage with regard to the GFP_NOHIGHIO logic
(I just checked 2.4.9ac15 has the same issue too, see the CAN_DO_FS
definition so this shouldn't been introduced recently).

This should fix it, and please also apply vm-tweaks-2 posted to l-k a
few minutes ago.

--- 2.4.10aa1/mm/vmscan.c	Sun Sep 23 22:16:22 2001
+++ vm/mm/vmscan.c	Wed Sep 26 16:34:30 2001
@@ -392,7 +384,7 @@
 			int (*writepage)(struct page *);
 
 			writepage = page->mapping->a_ops->writepage;
-			if ((gfp_mask & __GFP_FS) && writepage) {
+			if ((gfp_mask & __GFP_FS) && ((gfp_mask & __GFP_HIGHIO) || !PageHighMem(page)) && writepage) {
 				ClearPageDirty(page);
 				page_cache_get(page);
 				spin_unlock(&pagemap_lru_lock);


And if the above patch still doesn't help can you just apply this below
patch to disable the NOHIGHIO logic all together, just to make sure
we're looking in the right place?

--- 2.4.10aa1/mm/highmem.c.~1~	Sun Sep 23 21:11:43 2001
+++ 2.4.10aa1/mm/highmem.c	Wed Sep 26 16:38:34 2001
@@ -328,7 +328,7 @@
 	struct page *page;
 
 repeat_alloc:
-	page = alloc_page(GFP_NOHIGHIO);
+	page = alloc_page(GFP_NOIO);
 	if (page)
 		return page;
 	/*
@@ -366,7 +366,7 @@
 	struct buffer_head *bh;
 
 repeat_alloc:
-	bh = kmem_cache_alloc(bh_cachep, SLAB_NOHIGHIO);
+	bh = kmem_cache_alloc(bh_cachep, SLAB_NOIO);
 	if (bh)
 		return bh;
 	/*

Of course also make sure that a SYSRQ+e or SYSRQ+i doesn't relieve the
machine and allows to kill the D tasks :).

thanks!

Andrea

--qMm9M+Fa2AknHoGS--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 00:13:21 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Robert Macaulay <robert_macau...@dell.com>
Cc: Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>,
        Linus Torvalds <torva...@transmeta.com>
Subject: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928001321.L14277@athlon.random>
Original-References: <20010926164935.J27...@athlon.random> 
<Pine.LNX.4.33.0109261310340.23259-100...@ping.us.dell.com> <20010926203651.Q27...@athlon.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20010926203651.Q27945@athlon.random>; 
from andrea@suse.de on Wed, Sep 26, 2001 at 08:36:51PM +0200
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Thu, 27 Sep 2001 22:15:08 GMT
Message-ID: <fa.eais27v.1lgovri@ifi.uio.no>
References: <fa.ecj24fv.1ngutj7@ifi.uio.no>
Lines: 104

On Wed, Sep 26, 2001 at 08:36:51PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 26, 2001 at 01:17:29PM -0500, Robert Macaulay wrote:
> > We've tried the 2.4.10 with vmtweaks2 on out machine with 8GB RAM. It was 
> > looking good for a while, until it just stopped. Here is what was 
> > happening on the machine.
> > 
> > I was ftping files into the box at a rate of about 8MB/sec. This continued 
> > until all the RAM was in the  cache column. This was earlier in the 
> > included vmstat output. The I started a dd if=/dev/sde of=/dev/null in a 
> > new window.
> > 
> > All was looking good until it just stopped. I captured the vmstat below. 
> > vmstat continued running for about 1 minute, then it died too. What other 
> > info can I provide?
> 
> the best/first info in this case would be sysrq+T along with the system.map.

Ok, both your trace and Bob's trace show the problem clearly. thanks
to both for the helpful feedback btw.

The deadlock happens because of a collision between write_some_buffers()
and the GFP_NOHIGHIO logic. The deadlock was not introduced in the vm
rewrite but it was introduced with the nohighio logic.

The problem is that we are locking a couple of buffers, and later - after
they're all locked - we start writing them via write_locked_buffers.

The deadlock happens in the middle of write_locked_buffers when we hit
an highmem buffer, so while allocating with GFP_NOHIGHIO we end doing
sync_page_buffers on any page that isn't highmem, but that incidentally is one of the
other next buffers in the array that we previously locked in
write_some_buffers but that aren't in the I/O queue yet (so we'll wait
forever since they depends on us to be written).

Robert just confirmed that dropping the NOHIGHIO logic fixes the
problem.

So the fix is either:

1) to drop the NOHIGHIO logic like my test patch did
2) or to keep track of what buffers we must not wait while releasing
   ram

I'll try approch 2) in the below untested patch (the nohighio logic make
sense so I'd prefer not to drop it), Robert and Bob, can you give it a
spin on the highmem boxes and check if it helps?

I suggest to test it on top of 2.4.10+vm-tweaks-2.

--- 2.4.10aa2/fs/buffer.c.~1~	Wed Sep 26 18:45:29 2001
+++ 2.4.10aa2/fs/buffer.c	Fri Sep 28 00:04:44 2001
@@ -194,6 +194,7 @@
 		struct buffer_head * bh = *array++;
 		bh->b_end_io = end_buffer_io_sync;
 		submit_bh(WRITE, bh);
+		clear_bit(BH_Pending_IO, &bh->b_state);
 	} while (--count);
 }
 
@@ -225,6 +226,7 @@
 		if (atomic_set_buffer_clean(bh)) {
 			__refile_buffer(bh);
 			get_bh(bh);
+			set_bit(BH_Pending_IO, &bh->b_state);
 			array[count++] = bh;
 			if (count < NRSYNC)
 				continue;
@@ -2519,7 +2521,9 @@
 	int tryagain = 1;
 
 	do {
-		if (buffer_dirty(p) || buffer_locked(p)) {
+		if (unlikely(buffer_pending_IO(p)))
+			tryagain = 0;
+		else if (buffer_dirty(p) || buffer_locked(p)) {
 			if (test_and_set_bit(BH_Wait_IO, &p->b_state)) {
 				if (buffer_dirty(p)) {
 					ll_rw_block(WRITE, 1, &p);
--- 2.4.10aa2/include/linux/fs.h.~1~	Wed Sep 26 18:51:25 2001
+++ 2.4.10aa2/include/linux/fs.h	Fri Sep 28 00:01:54 2001
@@ -217,6 +217,7 @@
 	BH_New,		/* 1 if the buffer is new and not yet written out */
 	BH_Async,	/* 1 if the buffer is under end_buffer_io_async I/O */
 	BH_Wait_IO,	/* 1 if we should throttle on this buffer */
+	BH_Pending_IO,	/* 1 if the buffer is locked but not in the I/O queue yet */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -277,6 +278,7 @@
 #define buffer_mapped(bh)	__buffer_state(bh,Mapped)
 #define buffer_new(bh)		__buffer_state(bh,New)
 #define buffer_async(bh)	__buffer_state(bh,Async)
+#define buffer_pending_IO(bh)	__buffer_state(bh,Pending_IO)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
 

Thanks,
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Original-Date: 	Thu, 27 Sep 2001 16:16:11 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Andrea Arcangeli <and...@suse.de>
cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs.
 2.4.9-ac14/15(+stuff)]
In-Reply-To: <20010928001321.L14277@athlon.random>
Original-Message-ID: <Pine.LNX.4.33.0109271605550.25667-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Thu, 27 Sep 2001 23:17:51 GMT
Message-ID: <fa.ncm16mv.fn2g8d@ifi.uio.no>
References: <fa.eais27v.1lgovri@ifi.uio.no>
Lines: 45


On Fri, 28 Sep 2001, Andrea Arcangeli wrote:
>
> The deadlock happens in the middle of write_locked_buffers when we hit
> an highmem buffer, so while allocating with GFP_NOHIGHIO we end doing
> sync_page_buffers on any page that isn't highmem, but that incidentally is one of the
> other next buffers in the array that we previously locked in
> write_some_buffers but that aren't in the I/O queue yet (so we'll wait
> forever since they depends on us to be written).

Interesting, indeed..

However, your patch is racy:

> --- 2.4.10aa2/fs/buffer.c.~1~	Wed Sep 26 18:45:29 2001
> +++ 2.4.10aa2/fs/buffer.c	Fri Sep 28 00:04:44 2001
> @@ -194,6 +194,7 @@
>  		struct buffer_head * bh = *array++;
>  		bh->b_end_io = end_buffer_io_sync;
>  		submit_bh(WRITE, bh);
> +		clear_bit(BH_Pending_IO, &bh->b_state);

No way can we clear the bit here, because the submit_bh() may have caused
the buffer to be unlocked and IO to have completed, and it is no longer
"owned" by us - somebody else might have started IO on it and we'd be
clearing the bit for the wrong user.

I would suggest a totally different approach: make the "can we wait for
existing buffer heads" condition a GFP bit the same way the HIGHIO thing
is a GFP bit, and just not set it for GFP_NOHIGHIO.

Thinking about it, I think GFP_NOIO also implies "we must not wait for
other buffers", because that could deadlock for _other_ things too, like
loop and NBD (which use NOIO to make sure that they don't recurse - but
that should also imply not waiting for themselves). The GFP_xxx approach
should fix those deadlocks too.

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Original-Date: 	Thu, 27 Sep 2001 16:18:58 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Andrea Arcangeli <and...@suse.de>
cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs.
 2.4.9-ac14/15(+stuff)]
In-Reply-To: <Pine.LNX.4.33.0109271605550.25667-100000@penguin.transmeta.com>
Original-Message-ID: <Pine.LNX.4.33.0109271618120.25667-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Thu, 27 Sep 2001 23:20:51 GMT
Message-ID: <fa.nalr6uv.dncg00@ifi.uio.no>
References: <fa.ncm16mv.fn2g8d@ifi.uio.no>
Lines: 59


On Thu, 27 Sep 2001, Linus Torvalds wrote:
>
> Thinking about it, I think GFP_NOIO also implies "we must not wait for
> other buffers", because that could deadlock for _other_ things too, like
> loop and NBD (which use NOIO to make sure that they don't recurse - but
> that should also imply not waiting for themselves). The GFP_xxx approach
> should fix those deadlocks too.

Ie the patch would be something like the attached..

		Linus

------
diff -u --recursive --new-file v2.4.10/linux/fs/buffer.c linux/fs/buffer.c
--- v2.4.10/linux/fs/buffer.c	Wed Sep 26 11:53:42 2001
+++ linux/fs/buffer.c	Thu Sep 27 16:13:47 2001
@@ -2522,7 +2373,7 @@
 					ll_rw_block(WRITE, 1, &p);
 					tryagain = 0;
 				} else if (buffer_locked(p)) {
-					if (gfp_mask & __GFP_WAIT) {
+					if (gfp_mask & __GFP_WAITBUF) {
 						wait_on_buffer(p);
 						tryagain = 1;
 					} else
diff -u --recursive --new-file v2.4.10/linux/include/linux/mm.h linux/include/linux/mm.h
--- v2.4.10/linux/include/linux/mm.h	Sun Sep 23 11:41:01 2001
+++ linux/include/linux/mm.h	Thu Sep 27 16:13:35 2001
@@ -550,16 +550,17 @@
 #define __GFP_IO	0x40	/* Can start low memory physical IO? */
 #define __GFP_HIGHIO	0x80	/* Can start high mem physical IO? */
 #define __GFP_FS	0x100	/* Can call down to low-level FS? */
+#define __GFP_WAITBUF	0x200	/* Can we wait for buffers to complete? */

 #define GFP_NOHIGHIO	(__GFP_HIGH | __GFP_WAIT | __GFP_IO)
 #define GFP_NOIO	(__GFP_HIGH | __GFP_WAIT)
-#define GFP_NOFS	(__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO)
+#define GFP_NOFS	(__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_WAITBUF)
 #define GFP_ATOMIC	(__GFP_HIGH)
-#define GFP_USER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
-#define GFP_HIGHUSER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM)
-#define GFP_KERNEL	(__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
-#define GFP_NFS		(__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
-#define GFP_KSWAPD	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
+#define GFP_USER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_WAITBUF | __GFP_FS)
+#define GFP_HIGHUSER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_WAITBUF | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_KERNEL	(__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_WAITBUF | __GFP_FS)
+#define GFP_NFS		(__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_WAITBUF | __GFP_FS)
+#define GFP_KSWAPD	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_WAITBUF | __GFP_FS)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 01:37:30 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Linus Torvalds <torva...@transmeta.com>
Cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928013730.Y14277@athlon.random>
Original-References: 
<Pine.LNX.4.33.0109271605550.25667-100...@penguin.transmeta.com> 
<Pine.LNX.4.33.0109271618120.25667-100...@penguin.transmeta.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109271618120.25667-100000@penguin.transmeta.com>; 
from torvalds@transmeta.com on Thu, Sep 27, 2001 at 04:18:58PM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Thu, 27 Sep 2001 23:39:25 GMT
Message-ID: <fa.eb2s2nv.1l0ovb3@ifi.uio.no>
References: <fa.nalr6uv.dncg00@ifi.uio.no>
Lines: 21

On Thu, Sep 27, 2001 at 04:18:58PM -0700, Linus Torvalds wrote:
> 
> On Thu, 27 Sep 2001, Linus Torvalds wrote:
> >
> > Thinking about it, I think GFP_NOIO also implies "we must not wait for
> > other buffers", because that could deadlock for _other_ things too, like
> > loop and NBD (which use NOIO to make sure that they don't recurse - but
> > that should also imply not waiting for themselves). The GFP_xxx approach
> > should fix those deadlocks too.
> 
> Ie the patch would be something like the attached..

well this approch is much less finegrined... but yes, it would fix the
deadlock.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!148.122.208.68!news2.oke.nextra.no!
nextra.com!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 01:47:20 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Linus Torvalds <torva...@transmeta.com>
Cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928014720.Z14277@athlon.random>
Original-References: <20010928001321.L14...@athlon.random> 
<Pine.LNX.4.33.0109271605550.25667-100...@penguin.transmeta.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109271605550.25667-100000@penguin.transmeta.com>; 
from torvalds@transmeta.com on Thu, Sep 27, 2001 at 04:16:11PM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Thu, 27 Sep 2001 23:48:55 GMT
Message-ID: <fa.eais2vv.1lgouj0@ifi.uio.no>
References: <fa.ncm16mv.fn2g8d@ifi.uio.no>
Lines: 34

On Thu, Sep 27, 2001 at 04:16:11PM -0700, Linus Torvalds wrote:
> 
> On Fri, 28 Sep 2001, Andrea Arcangeli wrote:
> However, your patch is racy:
> 
> > --- 2.4.10aa2/fs/buffer.c.~1~	Wed Sep 26 18:45:29 2001
> > +++ 2.4.10aa2/fs/buffer.c	Fri Sep 28 00:04:44 2001
> > @@ -194,6 +194,7 @@
> >  		struct buffer_head * bh = *array++;
> >  		bh->b_end_io = end_buffer_io_sync;
> >  		submit_bh(WRITE, bh);
> > +		clear_bit(BH_Pending_IO, &bh->b_state);
> 
> No way can we clear the bit here, because the submit_bh() may have caused
> the buffer to be unlocked and IO to have completed, and it is no longer
> "owned" by us - somebody else might have started IO on it and we'd be
> clearing the bit for the wrong user.

Moving clear_bit just above submit_bh will fix it (please Robert make
this change before testing it), because if we block in submit_bh in the
bounce, then we won't deadlock on ourself because of the pagehighmem
check, and all previous non-pending bh are ok too, (only the next are
problematic, and they're still marked pending_IO so we can't deadlock on
them).

So you can re-consider my approch, the design of the fix was ok, it was
just a silly implementation error.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!sn-xit-02!supernews.com!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Original-Date: 	Thu, 27 Sep 2001 17:03:49 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Andrea Arcangeli <and...@suse.de>
cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs.
 2.4.9-ac14/15(+stuff)]
In-Reply-To: <20010928014720.Z14277@athlon.random>
Original-Message-ID: <Pine.LNX.4.33.0109271700001.32086-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 00:05:53 GMT
Message-ID: <fa.na5j5dv.d74hg6@ifi.uio.no>
References: <fa.eais2vv.1lgouj0@ifi.uio.no>
Lines: 27


On Fri, 28 Sep 2001, Andrea Arcangeli wrote:
>
> Moving clear_bit just above submit_bh will fix it (please Robert make
> this change before testing it), because if we block in submit_bh in the
> bounce, then we won't deadlock on ourself because of the pagehighmem
> check

We won't block on _ourselves_, but we can block on _two_ people doing it,
and blocking on each others requests that are blocked waiting on a bounce
buffer. Both will have one locked buffer, both will be waiting for the
other person unlocking that buffer, and neither will ever make progress.

You could clear that bit _after_ the bounce buffer allocation, I suspect.

But I also suspect that it doesn't matter much, and as I can imagine
similar problems with GFP_NOIO and loopback etc (do you see any reason why
loopback couldn't deadlock on waiting for itself?), I think the GFP_XXX
thing is the proper fix.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!sn-xit-02!supernews.com!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 02:08:10 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Linus Torvalds <torva...@transmeta.com>
Cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928020810.C14277@athlon.random>
Original-References: <20010928001321.L14...@athlon.random> 
<Pine.LNX.4.33.0109271605550.25667-100...@penguin.transmeta.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109271605550.25667-100000@penguin.transmeta.com>; 
from torvalds@transmeta.com on Thu, Sep 27, 2001 at 04:16:11PM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 00:09:50 GMT
Message-ID: <fa.ea2u1vv.1k0uvjm@ifi.uio.no>
References: <fa.ncm16mv.fn2g8d@ifi.uio.no>
Lines: 27

On Thu, Sep 27, 2001 at 04:16:11PM -0700, Linus Torvalds wrote:
> Thinking about it, I think GFP_NOIO also implies "we must not wait for
> other buffers", because that could deadlock for _other_ things too, like
> loop and NBD (which use NOIO to make sure that they don't recurse - but
> that should also imply not waiting for themselves). The GFP_xxx approach
> should fix those deadlocks too.

I don't understand very well your point about GFP_NOIO, GFP_NOIO is a no
brainer, loop/NDB etc.. all them are safe since GFP_NOIO will forbid to
arrive in sync_page_buffers in first place.

The only brainer is the GFP_NOHIGHIO that can arrive there on lowmem
pages since it only protects against itself from all the callers via the
pagehighmem logic, so only the callers that locks down highmem and then
nohighmem and then start the I/O on the highmem are subject to the
highmem deadlock. The only point that locks down highmem and then
nohighmem and then starts I/O on highmem seems to be the
write_some_buffers. However I could agree if you're worried other places
does it too, but if they do we could teach them to use the pending_IO
information too so we could be more finegrined with my approch.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!sn-xit-02!supernews.com!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 02:11:15 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Linus Torvalds <torva...@transmeta.com>
Cc: Robert Macaulay <robert_macau...@dell.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928021115.D14277@athlon.random>
Original-References: <20010928014720.Z14...@athlon.random> 
<Pine.LNX.4.33.0109271700001.32086-100...@penguin.transmeta.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109271700001.32086-100000@penguin.transmeta.com>; 
from torvalds@transmeta.com on Thu, Sep 27, 2001 at 05:03:49PM -0700
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 00:12:37 GMT
Message-ID: <fa.ea3826v.1k0kvro@ifi.uio.no>
References: <fa.na5j5dv.d74hg6@ifi.uio.no>
Lines: 38

On Thu, Sep 27, 2001 at 05:03:49PM -0700, Linus Torvalds wrote:
> 
> On Fri, 28 Sep 2001, Andrea Arcangeli wrote:
> >
> > Moving clear_bit just above submit_bh will fix it (please Robert make
> > this change before testing it), because if we block in submit_bh in the
> > bounce, then we won't deadlock on ourself because of the pagehighmem
> > check
> 
> We won't block on _ourselves_, but we can block on _two_ people doing it,

If other people waits for us it's ok (if they waits it means they're not
using GFP_NOIO and they're also not using GFP_NOHIGHIO).

We cannot wait on other two people doing it since they would be highmem
pages and the pagehighmem check forbids that.

> and blocking on each others requests that are blocked waiting on a bounce
> buffer. Both will have one locked buffer, both will be waiting for the
> other person unlocking that buffer, and neither will ever make progress.
> 
> You could clear that bit _after_ the bounce buffer allocation, I suspect.

I don't think it's necessary.

> But I also suspect that it doesn't matter much, and as I can imagine
> similar problems with GFP_NOIO and loopback etc (do you see any reason why
> loopback couldn't deadlock on waiting for itself?), I think the GFP_XXX
> thing is the proper fix.

GFP_NOIO is a no brainer, it cannot go wrong see the other email.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!news-feed.ifi.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Thu, 27 Sep 2001 20:51:42 -0300 (BRST)
From: Rik van Riel <r...@conectiva.com.br>
X-X-Sender:  <r...@imladris.rielhome.conectiva>
To: Andrea Arcangeli <and...@suse.de>
Cc: Linus Torvalds <torva...@transmeta.com>,
        Robert Macaulay <robert_macau...@dell.com>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs.
 2.4.9-ac14/15(+stuff)]
In-Reply-To: <20010928013730.Y14277@athlon.random>
Original-Message-ID: <Pine.LNX.4.33L.0109272050570.19147-100000@imladris.rielhome.conectiva>
X-spambait: aardv...@kernelnewbies.org
X-spammeplease: 	aardv...@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 01:23:51 GMT
Message-ID: <fa.q3qkolv.1vlkm00@ifi.uio.no>
References: <fa.eb2s2nv.1l0ovb3@ifi.uio.no>
Lines: 24

On Fri, 28 Sep 2001, Andrea Arcangeli wrote:

> well this approch is much less finegrined...

I'd consider that a feature. Undocumented subtle stuff
tends to break within 6 months, sometimes even due to
changes made by the same person who did the original
subtle trick.

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardv...@nl.linux.org (spam digging piggy)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 03:26:55 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Rik van Riel <r...@conectiva.com.br>
Cc: Linus Torvalds <torva...@transmeta.com>,
        Robert Macaulay <robert_macau...@dell.com>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928032655.H14277@athlon.random>
Original-References: <20010928013730.Y14...@athlon.random> 
<Pine.LNX.4.33L.0109272050570.19147-100...@imladris.rielhome.conectiva>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33L.0109272050570.19147-100000@imladris.rielhome.conectiva>; 
from riel@conectiva.com.br on Thu, Sep 27, 2001 at 08:51:42PM -0300
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 01:28:34 GMT
Message-ID: <fa.ec3a2fv.1m0mv3j@ifi.uio.no>
References: <fa.q3qkolv.1vlkm00@ifi.uio.no>
Lines: 18

On Thu, Sep 27, 2001 at 08:51:42PM -0300, Rik van Riel wrote:
> On Fri, 28 Sep 2001, Andrea Arcangeli wrote:
> 
> > well this approch is much less finegrined...
> 
> I'd consider that a feature. Undocumented subtle stuff
> tends to break within 6 months, sometimes even due to
> changes made by the same person who did the original
> subtle trick.

by the same argument we could drop the NOHIGHIO logic in first place.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Original-Date: 	Thu, 27 Sep 2001 18:28:48 -0700 (PDT)
From: Linus Torvalds <torva...@transmeta.com>
To: Rik van Riel <r...@conectiva.com.br>
cc: Andrea Arcangeli <and...@suse.de>,
        Robert Macaulay <robert_macau...@dell.com>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs.
 2.4.9-ac14/15(+stuff)]
In-Reply-To: <Pine.LNX.4.33L.0109272050570.19147-100000@imladris.rielhome.conectiva>
Original-Message-ID: <Pine.LNX.4.33.0109271827001.3101-100000@penguin.transmeta.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 01:30:42 GMT
Message-ID: <fa.o9q3d7v.g485rd@ifi.uio.no>
References: <fa.q3qkolv.1vlkm00@ifi.uio.no>
Lines: 15


Note that if you do end up applying my suggested patch for testing, you
also need to add __GFP_WAITBUF to SLAB_LEVEL_MASK in <linux/slab.h>
otherwise the slab allocator will be really unhappy the first time it sees
any normal allocation..

(Ie very early at boot).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: ping.us.dell.com: robert owned process doing -bs
Original-Date: 	Thu, 27 Sep 2001 21:12:25 -0500 (CDT)
From: Robert Macaulay <robert_macau...@dell.com>
X-X-Sender:  <rob...@ping.us.dell.com>
Reply-To: Robert Macaulay <robert_macau...@dell.com>
To: Andrea Arcangeli <and...@suse.de>
cc: Linus Torvalds <torva...@transmeta.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs.
 2.4.9-ac14/15(+stuff)]
In-Reply-To: <20010928014720.Z14277@athlon.random>
Original-Message-ID: <Pine.LNX.4.33.0109272108400.29056-100000@ping.us.dell.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 02:14:56 GMT
Message-ID: <fa.kpv6ktv.tn0q1k@ifi.uio.no>
References: <fa.eais2vv.1lgouj0@ifi.uio.no>
Lines: 24

On Thu, 27 Sep 2001, Andrea Arcangeli wrote:

> 
> Moving clear_bit just above submit_bh will fix it (please Robert make
> this change before testing it), because if we block in submit_bh in the
> bounce, then we won't deadlock on ourself because of the pagehighmem
> check, and all previous non-pending bh are ok too, (only the next are
> problematic, and they're still marked pending_IO so we can't deadlock on
> them).
> 
It worked. The box did not freeze.

We can try Linus' patch as well if needed. I had actually applied 
it and rebooted before the warning, and as predicted, it froze very 
early in the boot process.

Thanks Andrea. I'll see if we can repeat the 0-page alloc again.
Robert

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!newsfeed.google.com!newsfeed.stanford.edu!
news.tele.dk!small.news.tele.dk!129.240.148.23!uio.no!nntp.uio.no!
ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 28 Sep 2001 04:24:17 +0200
From: Andrea Arcangeli <and...@suse.de>
To: Robert Macaulay <robert_macau...@dell.com>
Cc: Linus Torvalds <torva...@transmeta.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, linux-ker...@vger.kernel.org,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: highmem deadlock fix [was Re: VM in 2.4.10(+tweaks) vs. 2.4.9-ac14/15(+stuff)]
Original-Message-ID: <20010928042417.J14277@athlon.random>
Original-References: <20010928014720.Z14...@athlon.random> 
<Pine.LNX.4.33.0109272108400.29056-100...@ping.us.dell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.33.0109272108400.29056-100000@ping.us.dell.com>; 
from robert_macaulay@dell.com on Thu, Sep 27, 2001 at 09:12:25PM -0500
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 02:26:02 GMT
Message-ID: <fa.ea3g2fv.1k0sv3j@ifi.uio.no>
References: <fa.kpv6ktv.tn0q1k@ifi.uio.no>
Lines: 19

On Thu, Sep 27, 2001 at 09:12:25PM -0500, Robert Macaulay wrote:
> Thanks Andrea. I'll see if we can repeat the 0-page alloc again.

Ok, it is possible the 0-page alloc failed because NOHIGHIO was
disabled, Linus's fix being less finegrined than mine could also lead
more easily to 0-page alloc failed.

However failing bounce-allocation is not important since we have the
reserved pool for those allocations. Not having to use the reserved
pool only allows an higher amount of I/O in parallel. This is why I said
we could have dropped the NOHIGHIO logic in first place if we wanted to
go the non finegrined way.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
newsfeeds.belnet.be!news.belnet.be!news.tele.dk!small.news.tele.dk!
194.213.69.151!news.algonet.se!algonet!newsfeed1.uni2.dk!news.net.uni-c.dk!
uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
X-Authentication-Warning: ping.us.dell.com: robert owned process doing -bs
Original-Date: 	Fri, 28 Sep 2001 09:02:18 -0500 (CDT)
From: Robert Macaulay <robert_macau...@dell.com>
X-X-Sender:  <rob...@ping.us.dell.com>
Reply-To: Robert Macaulay <robert_macau...@dell.com>
To: Andrea Arcangeli <and...@suse.de>
cc: Linus Torvalds <torva...@transmeta.com>,
        Rik van Riel <r...@conectiva.com.br>,
        Craig Kulesa <ckul...@as.arizona.edu>, <linux-ker...@vger.kernel.org>,
        Bob Matthews <bmatth...@redhat.com>,
        Marcelo Tosatti <marc...@conectiva.com.br>
Subject: LILO causes segmentation fault and panic [was Re: highmem deadlock
 fix]
In-Reply-To: <20010928042417.J14277@athlon.random>
Original-Message-ID: <Pine.LNX.4.33.0109280859280.30080-100000@ping.us.dell.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Fri, 28 Sep 2001 14:05:03 GMT
Message-ID: <fa.klfim5v.s7gr9n@ifi.uio.no>
References: <fa.ea3g2fv.1k0sv3j@ifi.uio.no>
Lines: 73

Not sure if this is 100% related to the latest patch, but after we had our 
0-order allocation failures, I ran lilo to switch to a new kernel, and it 
paniced. Its never done this before, so it might be related.

Robert

ksymoops 2.4.3 on i686 2.4.10-aaStuff.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.10-aaStuff/ (default)
     -m linux-2.4.10/System.map (specified)

Warning (compare_maps): mismatch on symbol partition_name  , ksyms_base 
says c01
cf820, System.map says c015a2b0.  Ignoring ksyms_base entry
invalid operand: 0000
CPU:    3
EIP:    0010:[<c012fb27>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: cc403648   ebx: cc403638   ecx: 000003f0   edx: 00000000
esi: cc403638   edi: 000003f0   ebp: 00000246   esp: e3659ea0
ds: 0018   es: 0018   ss: 0018
Process lilo (pid: 6666, stackpage=e3659000)
Stack: 00000000 e3658000 e3658000 00001a0b c024d6bb c011bec6 00001a0b 
cc403638
       cc403640 cc403638 00000246 c0130191 cc403638 000003f0 e3658000 
00001a0b
       c024d6bb 00001a0b c0340018 ea03f400 fffffff4 c03415a0 00000000 
f89166b6
Call Trace: [<c011bec6>] [<c0130191>] [<f89166b6>] [<c0193332>] 
[<c0141936>]
   [<c0138656>] [<c014ce9c>] [<c013855d>] [<c014481e>] [<c0138894>] 
[<c010710b>
Code: 0f 0b f7 c7 00 10 00 00 0f 85 10 02 00 00 b8 00 e0 ff ff 21

>>EIP; c012fb26 <kmem_cache_grow+16/240>   <=====
Trace; c011bec6 <sys_waitpid+16/20>
Trace; c0130190 <kmalloc+150/180>
Trace; f89166b6 <[ide-cd]ide_cdrom_open+36/80>
Trace; c0193332 <ide_open+d2/100>
Trace; c0141936 <blkdev_open+76/d0>
Trace; c0138656 <dentry_open+e6/190>
Trace; c014ce9c <dput+1c/160>
Trace; c013855c <filp_open+4c/60>
Trace; c014481e <getname+5e/a0>
Trace; c0138894 <sys_open+34/c0>
Trace; c010710a <system_call+32/38>
Code;  c012fb26 <kmem_cache_grow+16/240>
00000000 <_EIP>:
Code;  c012fb26 <kmem_cache_grow+16/240>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c012fb28 <kmem_cache_grow+18/240>
   2:   f7 c7 00 10 00 00         test   $0x1000,%edi
Code;  c012fb2e <kmem_cache_grow+1e/240>
   8:   0f 85 10 02 00 00         jne    21e <_EIP+0x21e> c012fd44 
<kmem_cache_g
row+234/240>
Code;  c012fb34 <kmem_cache_grow+24/240>
   e:   b8 00 e0 ff ff            mov    $0xffffe000,%eax
Code;  c012fb38 <kmem_cache_grow+28/240>
  13:   21 00                     and    %eax,(%eax)


1 warning issued.  Results may not be reliable.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/