[PATCH] Recent VM fiasco - fixed

Hi to all!

After I _finally_ got tired of the constant worse and worse VM
behaviour in the recent kernels, I thought I could spare few hours
this weekend just to see what's going on. I was quite surprised to see
that VM subsystem, while at its worst condition (at least in 2.3.x),
is quite easily repairable even to unskilled ones... I compiled and
checked few kernels back to 2.3.51, and found that new code was
constantly added just to make things go worse. Short history:

2.3.51 - mostly OK, but reading from disk takes too much CPU (kswapd)
2.3.99-pre1, 2 - as .51 + aggressive swap out during writing
2.3.99-pre3, 4, 5 - reading better
2.3.99-pre5, 6 - both reading and writing take 100% CPU!!!

I also tried some pre7-x (forgot which one) but that one was f****d up
beyond a recognition (read: was killing my processes including X11
like mad, every time I started writing to disk). Thus patch that
follows, and fixes all above mentioned problems, was made against
pre6, sorry. I'll made another patch when pre7 gets out, if things are
still not properly fixed.

BTW, this patch mostly *removes* cruft recently added, and returns to
the known state of operation. After that is achieved it is then easy
to selectively add good things I might have removed, and change
behaviour as wanted, but I would like to urge people to test things
thoroughly before releasing patches this close to 2.4.

Then again, I might have introduced bugs in this patch, too. :)
But, I *tried* to break it (spent some time doing that), and testing
didn't reveal any bad behaviour.

Enjoy!


Patch
-- 
Zlatko

Re: [PATCH] Recent VM fiasco - fixed

On 8 May 2000, Zlatko Calusic wrote:

> BTW, this patch mostly *removes* cruft recently added, and
> returns to the known state of operation.

Which doesn't work.

Think of a 1GB machine which has a 16MB DMA zone,
a 950MB normal zone and a very small HIGHMEM zone.

With the old VM code the HIGHMEM zone would be
swapping like mad while the other two zones are
idle.

It's Not That Kind Of Party(tm)

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Rik van Riel <riel@conectiva.com.br> writes:

> On 8 May 2000, Zlatko Calusic wrote:
> 
> > BTW, this patch mostly *removes* cruft recently added, and
> > returns to the known state of operation.
> 
> Which doesn't work.
> 
> Think of a 1GB machine which has a 16MB DMA zone,
> a 950MB normal zone and a very small HIGHMEM zone.
> 
> With the old VM code the HIGHMEM zone would be
> swapping like mad while the other two zones are
> idle.
> 
> It's Not That Kind Of Party(tm)
> 

OK, I see now what you have in mind, and I'll try to test it when I
get home (yes, late worker... my only connection to the Net :))
If only I could buy 1GB to test in the real setup. ;)

But still, optimizing for 1GB, while at the same time completely
killing performances even *usability* for the 99% of users doesn't
look like a good solution, does it?

There was lot of VM changes recently (>100K of patches) where we went
further and further away from the mostly stable code base (IMHO)
trying to fix zone balancing. Maybe it's time we try again, fresh from
the "start"?

I'll admit I didn't understand most of the conversation about zone
balancing recently on linux-mm. And I know it's because I didn't have
much time lately to hack the kernel, unfortunately.

But after few hours spent dealing with the horrible VM that is in the
pre6, I'm not scared anymore. And I think that solution to all our
problems with zone balancing must be very simple. But it's probably
hard to find, so it will need lots of modeling and testing. I don't
think adding few lines here and there all the time will take us
anywhere.

Regards,
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

On 8 May 2000, Zlatko Calusic wrote:
> 
> But still, optimizing for 1GB, while at the same time completely
> killing performances even *usability* for the 99% of users doesn't
> look like a good solution, does it?

Oh, definitely. I'll make a new pre7 that has a lot of the simplifications
discussed here over the weekend, and seems to work for me (tested both on
a 512MB setup and a 64MB setup for some sanity).

This pre7 almost certainly won't be all that perfect either, but gives a
better starting point.

> But after few hours spent dealing with the horrible VM that is in the
> pre6, I'm not scared anymore.

Good. This is really not scary stuff. Much of it is quite straightforward,
and is mainly just getting the right "feel". It's really easy to make
mistakes here, but they tend to be mistakes that just makes the system act
badly, not the kind of _really_ scary mistakes (the ones that make it
corrupt disks randomly ;)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

On 8 May 2000, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
> > On 8 May 2000, Zlatko Calusic wrote:
> > 
> > > BTW, this patch mostly *removes* cruft recently added, and
> > > returns to the known state of operation.
> > 
> > Which doesn't work.
> > 
> > Think of a 1GB machine which has a 16MB DMA zone,
> > a 950MB normal zone and a very small HIGHMEM zone.
> > 
> > With the old VM code the HIGHMEM zone would be
> > swapping like mad while the other two zones are
> > idle.
> > 
> > It's Not That Kind Of Party(tm)
> 
> OK, I see now what you have in mind, and I'll try to test it when I
> get home (yes, late worker... my only connection to the Net :))
> If only I could buy 1GB to test in the real setup. ;)
> 
> But still, optimizing for 1GB, while at the same time completely
> killing performances even *usability* for the 99% of users doesn't
> look like a good solution, does it?

20MB and 24MB machines will be in the same situation, if
that's of any help to you ;)

> But after few hours spent dealing with the horrible VM that is
> in the pre6, I'm not scared anymore. And I think that solution
> to all our problems with zone balancing must be very simple.

It is. Linus is working on a conservative & simple solution
while I'm trying a bit more "far-out" code (active and inactive
list a'la BSD, etc...). We should have at least one good VM
subsystem within the next few weeks ;)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Rik van Riel <riel@conectiva.com.br> writes:

> 20MB and 24MB machines will be in the same situation, if
> that's of any help to you ;)
> 

Yes, you are right. And thanks for that tip (booting with mem=24m)
because that will be my first test case later tonight.

> > But after few hours spent dealing with the horrible VM that is
> > in the pre6, I'm not scared anymore. And I think that solution
> > to all our problems with zone balancing must be very simple.
> 
> It is. Linus is working on a conservative & simple solution
> while I'm trying a bit more "far-out" code (active and inactive
> list a'la BSD, etc...). We should have at least one good VM
> subsystem within the next few weeks ;)
> 

Nice. I'm also in favour of some kind of active/inactive list
solution (looks promising), but that is probably 2.5.x stuff.

I would be happy to see 2.4 out ASAP. Later, when it stabilizes, we
will have lots of fun in 2.5, that's for sure.

Regards,
-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

On 8 May 2000, Zlatko Calusic wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
> 
> > > But after few hours spent dealing with the horrible VM that is
> > > in the pre6, I'm not scared anymore. And I think that solution
> > > to all our problems with zone balancing must be very simple.
> > 
> > It is. Linus is working on a conservative & simple solution
> > while I'm trying a bit more "far-out" code (active and inactive
> > list a'la BSD, etc...). We should have at least one good VM
> > subsystem within the next few weeks ;)
> 
> Nice. I'm also in favour of some kind of active/inactive list
> solution (looks promising), but that is probably 2.5.x stuff.

I have it booting (against pre7-4) and it seems almost
stable ;)  (with _low_ overhead)

> I would be happy to see 2.4 out ASAP. Later, when it stabilizes,
> we will have lots of fun in 2.5, that's for sure.

Of course, this has the highest priority.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Rik,
That's astonishing, I'm sure, but think of us poor bastards who DON'T have
an SMP machine with >1gig of RAM.

This is a P120, 32meg. Lately, fine has degenerated into bad into worse
into absolutely obscene. It even kills my PGSQL compiles.
And I killed *EVERYTHING* there was to kill.
The only processes were init, bash and gcc/cc1. VM still wiped it out.

d

On Mon, 8 May 2000, Rik van Riel wrote:

> On 8 May 2000, Zlatko Calusic wrote:
> 
> > BTW, this patch mostly *removes* cruft recently added, and
> > returns to the known state of operation.
> 
> Which doesn't work.
> 
> Think of a 1GB machine which has a 16MB DMA zone,
> a 950MB normal zone and a very small HIGHMEM zone.
> 
> With the old VM code the HIGHMEM zone would be
> swapping like mad while the other two zones are
> idle.
> 
> It's Not That Kind Of Party(tm)
> 
> cheers,
> 
> Rik
> --
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> 
> Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/		http://www.surriel.com/
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

On Tue, 9 May 2000, Daniel Stone wrote:

> That's astonishing, I'm sure, but think of us poor bastards who
> DON'T have an SMP machine with >1gig of RAM.
> 
> This is a P120, 32meg.

The old zoned VM code will run that machine as efficiently
as if it had 16MB of ram. See my point now?

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Daniel Stone <tamriel@ductape.net> writes:

> That's astonishing, I'm sure, but think of us poor bastards who
> DON'T have an SMP machine with >1gig of RAM.

He has to care obout us fortunate guys with e.g. 8GB memory also. The
recent kernels are broken for that also.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

On 9 May 2000, Christoph Rohland wrote:

> Daniel Stone <tamriel@ductape.net> writes:
> 
> > That's astonishing, I'm sure, but think of us poor bastards who
> > DON'T have an SMP machine with >1gig of RAM.
> 
> He has to care obout us fortunate guys with e.g. 8GB memory also. The
> recent kernels are broken for that also.

Try out the really recent one - pre7-8. So far it hassome good reviews,
and I've tested it both on a 20MB machine and a 512MB one..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Linus Torvalds <torvalds@transmeta.com> writes:

> Try out the really recent one - pre7-8. So far it hassome good reviews,
> and I've tested it both on a 20MB machine and a 512MB one..

Nope, does more or less lockup after the first attempt to swap
something out. I can still run ls and free. but as soon as something
touches /proc it locks up. Also my test programs do not do anything
any more.

I append the mem and task info from sysrq. Mem info seems to not
change after lockup.

Greetings
		Christoph


Show Memory

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

On 9 May 2000, Christoph Rohland wrote:

> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > Try out the really recent one - pre7-8. So far it hassome good reviews,
> > and I've tested it both on a 20MB machine and a 512MB one..
> 
> Nope, does more or less lockup after the first attempt to swap
> something out. I can still run ls and free. but as soon as something
> touches /proc it locks up. Also my test programs do not do anything
> any more.

This may be due to an unrelated bug with the task_lock() fixing (see
separate patch from Manfred for that one).

> I append the mem and task info from sysrq. Mem info seems to not
> change after lockup.

I suspect that if you do right-alt + scrolllock, you'll see it looping on
a spinlock. Which is why the memory info isn't changing ;)

But I'll double-check the shm code (I didn't test anything that did any
shared memory, for example).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Linus Torvalds <torvalds@transmeta.com> writes:

> On 9 May 2000, Christoph Rohland wrote:
> 
> > Linus Torvalds <torvalds@transmeta.com> writes:
> > 
> > > Try out the really recent one - pre7-8. So far it hassome good reviews,
> > > and I've tested it both on a 20MB machine and a 512MB one..

> > I append the mem and task info from sysrq. Mem info seems to not
> > change after lockup.
> 
> I suspect that if you do right-alt + scrolllock, you'll see it looping on
> a spinlock. Which is why the memory info isn't changing ;)
> 
> But I'll double-check the shm code (I didn't test anything that did any
> shared memory, for example).

Juan Quintela's patch fixes the lockup. shm paging locked up on the
page lock.

Now I can give more data about pre7-8. After a short run I can say the
following:

The machine seems to be stable, but VM is mainly unbalanced:

[root@ls3016 /root]# vmstat 5
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id

[...]

 9  3  0      0 1460016   1588  11284   0   0     0     0  109 23524   4  96   0
 9  3  1   7552 557432   1004  19320   0 1607     0   402  186 42582   2  89   9
11  1  1  41972 111368    424  53740   0 6884     2  1721  277 25904   0  89  10
11  1  0  48084  11896    276  59404   0 1133     1   284  181  4439   0  95   5
13  2  2  48352 466952    180  52960   5 158     4    39  230  6381   2  98   0
10  3  1  53400 934204    248  59940 498 1442   128   363  272  3953   1  99   0
11  3  1  52624 878696    300  59820 248  50    81    13  148   971   0 100   0
11  1  0   4556 883852    316  16164 855   0   214     1  127 25188   3  97   0
12  0  0   3936 525620    316  15544   0   0     0     0  109 33969   4  96   0
12  0  0   3936 2029556    316  15544   0   0     0     0  123 19659   4  96   0
11  1  0   3936 686856    316  15544   0   0     0     0  117 14370   3  97   0
12  0  0   3936 388176    320  15544   0   0     0     0  121  7477   3  97   0
10  3  1  47660   5216     88  19992   0 9353     0  2341  757  1267   0  97   3
 VM: killing process ipctst
 6  6  1  36792 484880    152  26892  65 12307    21  3078 1619  2184   0  94   6
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
10  1  1  39620  66736    148  29364   8 494     2   125  327  1980   0 100   0
VM: killing process ipctst
 9  2  1  46536 627356    116  31072  87 8675    23  2169 1784  1412   0  96   4
10  0  1  46664 617368    116  31200   0  26     0     6  258   112   0 100   0
10  0  1  47300 607184    116  31832   0 126     0    32  291   110   0 100   0

So we are swapping out with lots of free memory and killing random
processes. The machine also becomes quite unresponsive compared to
pre4 on the same tests.

Greetings
		Christoph

-- 
Christoph Rohland               Tel:   +49 6227 748201
SAP AG                          Fax:   +49 6227 758201
LinuxLab                        Email: cr@sap.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Christoph Rohland <cr@sap.com> writes:

> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > On 9 May 2000, Christoph Rohland wrote:
> > 
> > > Linus Torvalds <torvalds@transmeta.com> writes:
> > > 
> > > > Try out the really recent one - pre7-8. So far it hassome good reviews,
> > > > and I've tested it both on a 20MB machine and a 512MB one..
> 
> > > I append the mem and task info from sysrq. Mem info seems to not
> > > change after lockup.
> > 
> > I suspect that if you do right-alt + scrolllock, you'll see it looping on
> > a spinlock. Which is why the memory info isn't changing ;)
> > 
> > But I'll double-check the shm code (I didn't test anything that did any
> > shared memory, for example).
> 
> Juan Quintela's patch fixes the lockup. shm paging locked up on the
> page lock.
> 
> Now I can give more data about pre7-8. After a short run I can say the
> following:
> 
> The machine seems to be stable, but VM is mainly unbalanced:
> 
> [root@ls3016 /root]# vmstat 5
>    procs                      memory    swap          io     system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
> 
> [...]
> 
>  9  3  0      0 1460016   1588  11284   0   0     0     0  109 23524   4  96   0
>  9  3  1   7552 557432   1004  19320   0 1607     0   402  186 42582   2  89   9
> 11  1  1  41972 111368    424  53740   0 6884     2  1721  277 25904   0  89  10
[ too many lines error, truncating... ]
>  9  2  1  46536 627356    116  31072  87 8675    23  2169 1784  1412   0  96   4
> 10  0  1  46664 617368    116  31200   0  26     0     6  258   112   0 100   0
> 10  0  1  47300 607184    116  31832   0 126     0    32  291   110   0 100   0
> 
> So we are swapping out with lots of free memory and killing random
> processes. The machine also becomes quite unresponsive compared to
> pre4 on the same tests.
> 

I'll second this!

I checked pre7-8 briefly, but I/O & MM interaction is bad. Lots of
swapping, lots of wasted CPU cycles and lots of dead writer processes
(write(2): out of memory, while there is 100MB in the page cache).

Back to my patch and working on the solution for the 20-24 MB & 1GB
machines. Anybody with spare 1GB RAM to help development? :)

-- 
Zlatko
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Linus> Try out the really recent one - pre7-8. So far it hassome good
Linus> reviews, and I've tested it both on a 20MB machine and a 512MB
Linus> one..

pre7-8 still isn't completely fixed, but it is better than pre6.

Try doing something like 'cp -a linux-2.3.99-pre7-8 foobar' and
watching kswapd in top (or qps, el al).  On my dual-proc box, kswapd
still maxes out one of the cpus.  Tar doesn't seem to show it, but
bzcat can get an occasional segfault on large files.

The filesystem, though, has 1k rather than 4k blocks.  Yeah, just
tested again on a fs w/ 4k blocks.  kswapd only used 50% to 65% of a
cpu, but that was an ide drive and the former was on a scsi drive.[1]

OTOH, in pre6 X would hit (or at least report) 2^32-1 major faults
after only a few hours of usage.  That bug is gone in pre7-8.

[1] asus p2b-ds mb using onboard adaptec scsi and piix ide; drives are
    all IBM ultrastars and deskstars.

-JimC
-- 
James H. Cloos, Jr.  <URL:http://jhcloos.com/public_key> 1024D/ED7DAEA6 
<cloos@jhcloos.com>  E9E9 F828 61A4 6EA9 0F2B  63E7 997A 9F17 ED7D AEA6
        Save Trees:  Get E-Gold! <URL:http://jhcloos.com/go?e-gold>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Ok, there's a pre7-9 out there, and the biggest change versus pre7-8 is
actually how block fs dirty data is flushed out. Instead of just waking up
kflushd and hoping for the best, we actually just write it out (and even
wait on it, if absolutely required).

Which makes the whole process much more streamlined, and makes the numbers
more repeatable. It also fixes the problem with dirty buffer cache data
much more efficiently than the kflushd approach, and mmap002 is not a
problem any more. At least for me.

[ I noticed that mmap002 finishes a whole lot faster if I never actually
  wait for the writes to complete, but that had some nasty behaviour under
  low memory circumstances, so it's not what pre7-9 actually does. I
  _suspect_ that I should start actually waiting for pages only when
  priority reaches 0 - comments welcomed, see fs/buffer.c and the
  sync_page_buffers() function ]

kswapd is still quite aggressive, and will show higher CPU time than
before. This is a tweaking issue - I suspect it is too aggressive right
now, but it needs more testing and feedback. 

Just the dirty buffer handling made quite an enormous difference, so
please do test this if you hated earlier pre7 kernels.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [PATCH] Recent VM fiasco - fixed

Some more explanations on the differences between pre7-8 and pre7-9..

Basically pre7-9 survives mmap002 quite gracefully, and I think it does so
for all the right reasons. It's not tuned for that load at all, it's just
that mmap002 was really good at showing two weak points of the mm layer:

 - try_to_free_pages() could actually return success without freeing a
   single page (just moving pages around to the swap cache). This was bad,
   because it could cause us to get into a situation where we
   "successfully" free'd pages without ever adding any to the list. Which
   would, for all the obvious reasons, cause problems later when we
   couldn't allocate a page after all..

 - The "sync_page_buffers()" thing to sync pages directly to disk rather
   than wait for bdflush to do it for us (and have people run out of
   memory before bdflush got around to the right pages).

   Sadly, as it was set up, try_to_free_buffers() doesn't even get the
   "urgency" flag, so right now it doesn't know whether it should wait for
   previous write-outs or not. So it always does, even though for
   non-critical allocations it should just ignore locked buffers.

Fixing these things suddenly made mmap002 behave quite well. I'll make the
change to pass in the priority to sync_page_buffers() so that I'll get the
increased performance from not waiting when I don't have to, but it starts
to look like pre7 is getting in shape.

		Linus

On Wed, 10 May 2000, Linus Torvalds wrote:
> 
> Ok, there's a pre7-9 out there, and the biggest change versus pre7-8 is
> actually how block fs dirty data is flushed out. Instead of just waking up
> kflushd and hoping for the best, we actually just write it out (and even
> wait on it, if absolutely required).
> 
> Which makes the whole process much more streamlined, and makes the numbers
> more repeatable. It also fixes the problem with dirty buffer cache data
> much more efficiently than the kflushd approach, and mmap002 is not a
> problem any more. At least for me.
> 
> [ I noticed that mmap002 finishes a whole lot faster if I never actually
>   wait for the writes to complete, but that had some nasty behaviour under
>   low memory circumstances, so it's not what pre7-9 actually does. I
>   _suspect_ that I should start actually waiting for pages only when
>   priority reaches 0 - comments welcomed, see fs/buffer.c and the
>   sync_page_buffers() function ]
> 
> kswapd is still quite aggressive, and will show higher CPU time than
> before. This is a tweaking issue - I suspect it is too aggressive right
> now, but it needs more testing and feedback. 
> 
> Just the dirty buffer handling made quite an enormous difference, so
> please do test this if you hated earlier pre7 kernels.
> 
> 		Linus
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

[patch] balanced highmem subsystem under pre7-9

IMO high memory should not be balanced. Stock pre7-9 tried to balance high
memory once it got below the treshold (causing very bad VM behavior and
high kswapd usage) - this is incorrect because there is nothing special
about the highmem zone, it's more like an 'extension' of the normal zone,
from which specific caches can turn. (patch attached)

another problem is that even during a mild test the DMA zone gets emptied
easily - but on a big RAM box kswapd has to work _alot_ to fill it up. In
fact on an 8GB box it's completely futile to fill up the DMA zone. What
worked for me is this zone-chainlist trick in the zone setup code:

                        case ZONE_NORMAL:
                                zone = pgdat->node_zones + ZONE_NORMAL;
                                if (zone->size)
                                        zonelist->zones[j++] = zone;
++                              break;
                        case ZONE_DMA:
                                zone = pgdat->node_zones + ZONE_DMA;
                                if (zone->size)
                                        zonelist->zones[j++] = zone;

no 'normal' allocation chain leads to the ZONE_DMA zone, except GFP_DMA
and GFP_ATOMIC - both of them rightfully access the DMA zone.

this is a RL problem, without the above a 8GB box under load crashes
pretty quickly due to failed SCSI-layer DMA allocations. (i think those
allocations are silly in the first place.)

the above is suboptimal on boxes which have total RAM within one order of
magnitude of 16MB (the DMA zone stays empty most of the time and is
unaccessible to various caches) - so maybe the following (not yet
implemented) solution would be generic and acceptable:

allocate 5% of total RAM or 16MB to the DMA zone (via fixing up zone sizes
on bootup), whichever is smaller, in 2MB increments. Disadvantage of this
method: eg. it wastes 2MB RAM on a 8MB box. We could probably live with
64kb increments (there are 64kb ISA DMA constraints the sound drivers and
some SCSI drivers are hitting) - is this really true? If nobody objects
i'll implement this later one (together with the assymetric allocation
chain trick) - there will be a 64kb DMA pool allocated on the smallest
boxes, which should be acceptable even on a 4MB box. We could turn off the
DMA zone altogether on most boxes, if it wasnt for the SCSI layer
allocating DMA pages even for PCI drivers ...

Comments?

	Ingo
--- linux/mm/page_alloc.c.orig	Thu May 11 02:10:34 2000
+++ linux/mm/page_alloc.c	Thu May 11 16:03:48 2000
@@ -553,9 +566,14 @@
 			mask = zone_balance_min[j];
 		else if (mask > zone_balance_max[j])
 			mask = zone_balance_max[j];
-		zone->pages_min = mask;
-		zone->pages_low = mask*2;
-		zone->pages_high = mask*3;
+		if (j == ZONE_HIGHMEM) {
+			zone->pages_low = zone->pages_high =
+						zone->pages_min = 0;
+		} else {
+			zone->pages_min = mask;
+			zone->pages_low = mask*2;
+			zone->pages_high = mask*3;
+		}
 		zone->low_on_memory = 0;
 		zone->zone_wake_kswapd = 0;
 		zone->zone_mem_map = mem_map + offset;

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 12 May 2000, Ingo Molnar wrote:

>IMO high memory should not be balanced. Stock pre7-9 tried to balance high
>memory once it got below the treshold (causing very bad VM behavior and
>high kswapd usage) - this is incorrect because there is nothing special
>about the highmem zone, it's more like an 'extension' of the normal zone,
>from which specific caches can turn. (patch attached)

IMHO that is an hack to workaround the currently broken design of the MM.
And it will also produce bad effect since you won't age the recycle the
cache in the highmem zone correctly.

Without classzone design you will always have kswapd and the page
allocator that shrink memory even if not necessary. Please check as
reference the very detailed explanation I posted around two weeks ago on
linux-mm in reply to Linus.

What you're trying to workaround on the highmem part is exactly the same
problem you also have between the normal zone and the dma zone. Why don't
you also just take 3mbyte always free from the dma zone and you never
shrink the normal zone?

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 12 May 2000, Andrea Arcangeli wrote:

> >IMO high memory should not be balanced. Stock pre7-9 tried to balance high
> >memory once it got below the treshold (causing very bad VM behavior and
> >high kswapd usage) - this is incorrect because there is nothing special
> >about the highmem zone, it's more like an 'extension' of the normal zone,
> >from which specific caches can turn. (patch attached)
> 
> IMHO that is an hack to workaround the currently broken design of the MM.
> And it will also produce bad effect since you won't age the recycle the
> cache in the highmem zone correctly.

what bad effects? the LRU list of the pagecache is a completely
independent mechanizm. Highmem pages are LRU-freed just as effectively as
normal pages. The pagecache LRU list is not per-zone but (IMHO correctly)
global, so the particular zone of highmem pages is completely transparent
and irrelevant to the LRU mechanizm. I cannot see any bad effects wrt. LRU
recycling and the highmem zone here. (let me know if you ment some
different recycling mechanizm)

> What you're trying to workaround on the highmem part is exactly the
> same problem you also have between the normal zone and the dma zone.
> Why don't you also just take 3mbyte always free from the dma zone and
> you never shrink the normal zone?

i'm not working around anything. Highmem _should not be balanced_, period.
It's a superset of normal memory, and by just balancing normal memory (and
adding highmem free count to the total) we are completely fine. Highmem is
also a temporary phenomenon, it will probably disappear in a few years
once 64-bit systems and proper 64-bit DMA becomes commonplace. (and small
devices will do 32-bit + 32-bit DMA.)

'balanced' means: 'keep X amount of highmem free'. What is your point in
keeping free highmem around?

the DMA zone resizing suggestion from yesterday is i believe conceptually
correct as well, _want to_ isolate normal allocators from these 'emergency
pools'. IRQ handlers cannot wait for more free RAM.


about classzone. This was the initial idea how to do balancing when the
zoned allocator was implemented (along with per-zone kswapd threads or
per-zone queues), but it just gets too complex IMHO. Why dont you give the
simpler suggestion from yesterday a thought? We have only one zone
essentially which has to be balanced, ZONE_NORMAL. ZONE_DMA is and should
become special because it also serves as an atomic pool for IRQ
allocations. (ZONE_HIGHMEM is special and uninteresting as far as memory
balance goes, as explained above.) So we only have ZONE_NORMAL to worry
about. Zonechains are perfect ways of defining fallback routes.

i've had a nicely balanced (heavily loaded) 8GB box for the past couple of
weeks, just by doing (yesterday's) slight trivial changes to the
zone-chains and watermarks. The default settings in the stock kernel were
not tuned, but all the mechanizm is there. LRU is working, there was
always DMA RAM around, no classzones necessery here. So what is exactly
the case you are trying to balance?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 12 May 2000, Ingo Molnar wrote:

>what bad effects? the LRU list of the pagecache is a completely
>independent mechanizm. Highmem pages are LRU-freed just as effectively as
>normal pages. The pagecache LRU list is not per-zone but (IMHO correctly)
>global, so the particular zone of highmem pages is completely transparent

It shouldn't be global but per-NUMA-node as I have in the classzone patch.

>and irrelevant to the LRU mechanizm. I cannot see any bad effects wrt. LRU
>recycling and the highmem zone here. (let me know if you ment some
>different recycling mechanizm)

See line 320 of filemap.c in 2.3.99-pre7-pre9. (ignore the fact it will
recycle 1 page, it's just because they didn't expected pages_high to be
zero)

>'balanced' means: 'keep X amount of highmem free'. What is your point in
>keeping free highmem around?

Assuming there is no point, you still want to free also from the highmem
zone while doing LRU aging of the cache.

And if you don't keep X amount of highmem free you'll break if an irq will
do a GFP_HIGHMEM allocation.

Note also that with highmem I don't mean not the memory between 1giga and
64giga, but the memory between 0 and 64giga. When you allocate with
GFP_HIGHUSER you ask to the MM a page between 0 and 64giga.

And in turn what is the point of keeping X amount of normal/regular memory
free? You just try to keep such X amount of memory free in the DMA zone,
so why you also try to keep it free on the normal zone? The problem is the
same.

Please read my emails on linux-mm of a few weeks ago about classzone
approch. I can forward them to linux-kernel if there is interest (I don't
know if there's a web archive but I guess there is).

If the current strict zone approch wouldn't be broken we could as well
choose to split the ZONE_HIGHMEM in 10/20 zones to scales 10/20 times
better during allocations, no? Is this argulemnt enough to make you to at
least ring a bell that the current design is flawed? The flaw is that we
pay that with drawbacks and by having the VM that does the wrong thing
because it have no enough information (it only see a little part of the
picture). You can't fix it without looking the whole picture (the
classzone).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 12 May 2000, Andrea Arcangeli wrote:
> On Fri, 12 May 2000, Ingo Molnar wrote:
> 
> >what bad effects? the LRU list of the pagecache is a completely
> >independent mechanizm. Highmem pages are LRU-freed just as effectively as
> >normal pages. The pagecache LRU list is not per-zone but (IMHO correctly)
> >global, so the particular zone of highmem pages is completely transparent
> 
> It shouldn't be global but per-NUMA-node as I have in the classzone patch.

*nod*

This change is in my source tree too (but the active/inactive
page list thing doesn't work yet).

> >and irrelevant to the LRU mechanizm. I cannot see any bad effects wrt. LRU
> >recycling and the highmem zone here. (let me know if you ment some
> >different recycling mechanizm)
> 
> See line 320 of filemap.c in 2.3.99-pre7-pre9. (ignore the fact
> it will recycle 1 page, it's just because they didn't expected
> pages_high to be zero)

Indeed, pages_high for the higmem zone probably shouldn't be zero.

pages_min and pages_low:  0
pages_high:               128???  (free up to 512kB of high memory)

> >'balanced' means: 'keep X amount of highmem free'. What is your point in
> >keeping free highmem around?
> 
> Assuming there is no point, you still want to free also from the
> highmem zone while doing LRU aging of the cache.

True, but this just involves setting the watermarks right. The
current code supports the balancing just fine.

> And if you don't keep X amount of highmem free you'll break if
> an irq will do a GFP_HIGHMEM allocation.

GFP_HIGHMEM will automatically fallback to the NORMAL zone.
There's no problem here.

> Note also that with highmem I don't mean not the memory between
> 1giga and 64giga, but the memory between 0 and 64giga.

Why do you keep insisting on meaning other things with words than
what everybody else means with them? ;)

> Please read my emails on linux-mm of a few weeks ago about
> classzone approch.

I've read them and it's overly complex and doesn't make much
sense for what we need.

> I can forward them to linux-kernel if there is interest (I don't
> know if there's a web archive but I guess there is).

http://mail.nl.linux.org/linux-mm/
http://www.linux.eu.org/Linux-MM/

> If the current strict zone approch wouldn't be broken we could
> as well choose to split the ZONE_HIGHMEM in 10/20 zones to
> scales 10/20 times better during allocations, no?

This would work just fine, except for the fact that we have
only one pagecache_lock ... maybe we want to have multiple
pagecache_locks based on a hash of the inode number? ;)

> Is this argulemnt enough to make you to at least ring a bell
> that the current design is flawed?

But we *can* split the HIGHMEM zone into a bunch of smaller
ones without affecting performance. Just set zone->pages_min
and zone->pages_low to 0 and zone->pages_high to some smallish
value. Then we can teach the allocator to skip the zone if:
1) no obscenely large amount of free pages
2) zone is locked by somebody else (TryLock(zone->lock))

This will work just fine with the current code (plus these
two minor tweaks). No big changes are needed to support this
idea.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 12 May 2000, Rik van Riel wrote:

> But we *can* split the HIGHMEM zone into a bunch of smaller
> ones without affecting performance. Just set zone->pages_min
> and zone->pages_low to 0 and zone->pages_high to some smallish
> value. Then we can teach the allocator to skip the zone if:
> 1) no obscenely large amount of free pages
> 2) zone is locked by somebody else (TryLock(zone->lock))

whats the point of this splitup? (i suspect there is a point, i just
cannot see it now. thanks.)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

[ sorry for the late reply ]

On Fri, 12 May 2000, Ingo Molnar wrote:

>On Fri, 12 May 2000, Rik van Riel wrote:
>
>> But we *can* split the HIGHMEM zone into a bunch of smaller
>> ones without affecting performance. Just set zone->pages_min
>> and zone->pages_low to 0 and zone->pages_high to some smallish
>> value. Then we can teach the allocator to skip the zone if:
>> 1) no obscenely large amount of free pages
>> 2) zone is locked by somebody else (TryLock(zone->lock))
>
>whats the point of this splitup? (i suspect there is a point, i just
>cannot see it now. thanks.)

I quote email from Rik of 25 Apr 2000 23:10:56 on linux-mm:

-- Message-ID: <Pine.LNX.4.21.0004252240280.14340-100000@duckman.conectiva> --
We can do this just fine. Splitting a box into a dozen more
zones than what we have currently should work just fine,
except for (as you say) higher cpu use by kwapd.

If I get my balancing patch right, most of that disadvantage
should be gone as well. Maybe we *do* want to do this on
bigger SMP boxes so each processor can start out with a
separate zone and check the other zone later to avoid lock
contention?
--------------------------------------------------------------

I still strongly think that the current zone strict mem balancing design
is very broken (and I also think to be right since I believe to see
the whole picture) but I don't think I can explain my arguments
better and/or more extensively of how I just did in linux-mm some week ago.

If you see anything wrong in my reasoning please let me know. The interesting
thread was "Re: 2.3.x mem balancing" (the start were off list) in linux-mm.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Thu, 18 May 2000, Andrea Arcangeli wrote:

> I still strongly think that the current zone strict mem
> balancing design is very broken (and I also think to be right
> since I believe to see the whole picture) but I don't think I
> can explain my arguments better and/or more extensively of how I
> just did in linux-mm some week ago.

The balancing as of pre9-2 works like this:
- LRU list per pgdat
- kswapd runs and makes sure every zone has > zone->pages_low
  free pages, after that it stops
- kswapd frees up to zone->pages_high pages, depending on what
  pages we encounter in the LRU queue, this will make sure that
  the zone with most least recently used pages will have more
  free pages
- __alloc_pages() allocates all pages up to zone->pages_low on
  every zone before waking up kswapd, this makes sure more pages
  from the least loaded zone will be used than from more loaded
  zones, this will make sure balancing between zones happens

I'm curious what would be so "very broken" about this?

AFAICS it does most of what the classzone patch would achieve,
at lower complexity and better readability.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 19 May 2000, Rik van Riel wrote:

>I'm curious what would be so "very broken" about this?

You start eating from ZONE_DMA before you made empty ZONE_NORMAL.

>AFAICS it does most of what the classzone patch would achieve,
>at lower complexity and better readability.

I disagree.

Andrea


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

Re: [patch] balanced highmem subsystem under pre7-9

On Fri, 19 May 2000, Andrea Arcangeli wrote:

> On Fri, 19 May 2000, Rik van Riel wrote:
> 
> >I'm curious what would be so "very broken" about this?
> 
> You start eating from ZONE_DMA before you made empty ZONE_NORMAL.

THIS IS NOT A BUG!

It's a feature. I don't see why you insist on calling this a problem.

We do NOT keep free memory around just for DMA allocations. We
fundamentally keep free memory around because the buddy allocator (_any_
allocator, in fact) needs some slop in order to do a reasonable job at
allocating contiguous page regions, for example. We keep free memory
around because that way we have a "buffer" to allocate from atomically, so
that when network traffic occurs or there is other behaviour that requires
memory without being able to free it on the spot, we have memory to give.

Keeping only DMA memory around would be =bad=. It would mean, for example,
that when a new packet comes in on the network, it would always be
allocated from the DMA region, because the normal zone hasn't even been
balanced ("why balance it when we still have DMA memory?"). And that would
be a huge mistake, because that would mean, for example, that by selecting
the right allocation patterns and by opening sockets without reading the
data they receive the right way, somebody could force all of DMA memory to
be used up by network allocations that wouldn't be free'd.

In short, your very fundamental premise is BROKEN, Andrea. We want to keep
normal memory around, even if there is low memory available. The same is
true of high memory, for similar reasons. 

Face it. The original zone-only code had problems. One of the worst
problems was that it would try to free up a lot of "normal" memory if it
got low on DMA memory. Those problems have pretty much been fixed, and
they had _nothing_ to do with your "class" patches. They were bugs, plain
and simple, not design mistakes.

If you think you should have zero free normal pages, YOU have a design
mistake. We should not be that black-and-white. The whole point in having
the min/low/max stuff is to make memory allocation less susceptible to
border conditions, and turn a black-and-white situation into more of a
"levels of gray" situation.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/