Message-ID: <3D3500AA.131CE2EB@zip.com.au>
Date: Wed, 17 Jul 2002 07:30:06 +0200
From: Andrew Morton <a...@zip.com.au>
X-Mailer: Mozilla 4.79 [en] (X11; U; Linux 2.4.19-pre9 i686)
X-Accept-Language: en
MIME-Version: 1.0
Subject: [patch 1/13] minimal rmap
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: a.27.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
cyclone.bc.net!news.mailgate.org!bofh.it!robomod
X-Original-Cc: lkml <linux-ker...@vger.kernel.org>
X-Original-Date: Tue, 16 Jul 2002 22:29:14 -0700
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Linus Torvalds <torva...@transmeta.com>
Lines: 1759


This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
Kulsea.

Basically,

before: When the page reclaim code decides that is has scanned too many
unreclaimable pages on the LRU it does a scan of process virtual
address spaces for pages to add to swapcache.  ptes pointing at the
page are unmapped as the scan proceeds.  When all ptes referring to a
page have been unmapped and it has been written to swap the page is
reclaimable.

after: When an anonymous page is encountered on the tail of the LRU we
use the rmap to see if it hasn't been referenced lately.  If so then
add it to swapcache.  When the page is again encountered on the LRU, if
it is still unreferenced then try to unmap all ptes which refer to it
in one hit, and if it is clean (ie: on swap) then free it.

The rest of the VM - list management, the classzone concept, etc
remains unchanged.

There are a number of things which the per-page pte chain could be
used for.  Bill Irwin has identified the following.


(1)  page replacement no longer goes around randomly unmapping things

(2)  referenced bits are more accurate because there aren't several ms
        or even seconds between find the multiple pte's mapping a page

(3)  reduces page replacement from O(total virtually mapped) to O(physical)

(4)  enables defragmentation of physical memory

(5)  enables cooperative offlining of memory for friendly guest instance
        behavior in UML and/or LPAR settings

(6)  demonstrable benefit in performance of swapping which is common in
        end-user interactive workstation workloads (I don't like the word
        "desktop"). c.f. Craig Kulesa's post wrt. swapping performance

(7)  evidence from 2.4-based rmap trees indicates approximate parity
        with mainline in kernel compiles with appropriate locking bits

(8)  partitioning of physical memory can reduce the complexity of page
        replacement searches by scanning only the "interesting" zones
        implemented and merged in 2.4-based rmap

(9)  partitioning of physical memory can increase the parallelism of page
        replacement searches by independently processing different zones
        implemented, but not merged in 2.4-based rmap

(10) the reverse mappings may be used for efficiently keeping pte cache
        attributes coherent

(11) they may be used for virtual cache invalidation (with changes)

(12) the reverse mappings enable proper RSS limit enforcement
        implemented and merged in 2.4-based rmap



The code adds a pointer to struct page, consumes additional storage for
the pte chains and adds computational expense to the page reclaim code
(I measured it at 3% additional load during streaming I/O).  The
benefits which we get back for all this are, I must say, theoretical
and unproven.  If it has real advantages (or, indeed, disadvantages)
then why has nobody demonstrated them?



There are a number of things remaining to be done:

1: Demonstrate the above advantages.

2: Make it work with pte-highmem  (Bill Irwin is signed up for this)

3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
   patch does this)

4: Move the pte_chains into highmem too (Bill, I guess)

5: per-cpu pte_chain freelists (Rik?)

6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)

7: multithread the page reclaim code.  (I have patches).

8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
   often well-ordered-by-virtual-address on the LRU, so it "just
   works" for benchmarky loads.  But there may be some other loads...

9: Fix bad IO latency in page reclaim (I have lame patches)

10: Develop tuning tools, use them.

11: The nightly updatedb run is still evicting everything.


Patch

.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Date: Wed, 17 Jul 2002 10:30:12 +0200
From: Russell King <r...@arm.linux.org.uk>
Subject: Re: [patch 1/13] minimal rmap
Message-ID: <20020717092446.A4329@flint.arm.linux.org.uk>
References: <3D3500AA.131CE2EB@zip.com.au>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <3D3500AA.131CE2EB@zip.com.au>; 
from akpm@zip.com.au on Tue, Jul 16, 2002 at 10:29:14PM -0700
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: a.645.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!nntp.infostrada.it!
bofh.it!robomod
X-Original-Cc: Linus Torvalds <torva...@transmeta.com>,
	lkml <linux-ker...@vger.kernel.org>
X-Original-Date: Wed, 17 Jul 2002 09:24:46 +0100
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Andrew Morton <a...@zip.com.au>
Lines: 53

On Tue, Jul 16, 2002 at 10:29:14PM -0700, Andrew Morton wrote:

I'm puzzling over this difference:

> --- /dev/null	Thu Aug 30 13:30:55 2001
> +++ 2.5.26-akpm/include/asm-arm/proc-armv/rmap.h	Tue Jul 16 21:59:40 2002
>...
> +static inline void pgtable_add_rmap(pte_t * ptep, struct mm_struct * mm, unsigned long address)
> +{
> +	struct page * page = virt_to_page(ptep);
> +
> +	page->mm = mm;
> +	page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
> +}

and

> --- /dev/null	Thu Aug 30 13:30:55 2001
> +++ 2.5.26-akpm/include/asm-generic/rmap.h	Tue Jul 16 21:59:40 2002
> +static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
> +{
> +#ifdef BROKEN_PPC_PTE_ALLOC_ONE
> +	/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
> +	extern int mem_init_done;
> +
> +	if (!mem_init_done)
> +		return;
> +#endif
> +	page->mapping = (void *)mm;
> +	page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
> +}

Note that the ARM one seems to be using page->mm but everything else
uses page->mapping.

Also, this comment:

> + * ARM is different since hardware page tables are smaller than
> + * the page size and Linux uses a "duplicate" one with extra info.
> + * For rmap this means that the first 2 kB of a page are the hardware
> + * page tables and the last 2 kB are the software page tables.

is no longer true for 2.5 (although it is still true for 2.4.)

-- 
Russell King (r...@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Date: Wed, 17 Jul 2002 14:20:08 +0200
From: Rik van Riel <r...@conectiva.com.br>
X-X-Sender: r...@imladris.surriel.com
Subject: Re: [patch 1/13] minimal rmap
In-Reply-To: <20020717092446.A4329@flint.arm.linux.org.uk>
Message-ID: <Pine.LNX.4.44L.0207170908130.12241-100000@imladris.surriel.com>
X-Spambait: aardv...@kernelnewbies.org
X-Spammeplease: aardv...@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: a.866.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
newsmi-us.news.garr.it!newsmi-eu.news.garr.it!newsrm.news.garr.it!
NewsITBone-GARR!newsfeeder.edisontel.com!bofh.it!robomod
References: <20020717092446.A4329@flint.arm.linux.org.uk>
X-Original-Cc: Andrew Morton <a...@zip.com.au>,
	Linus Torvalds <torva...@transmeta.com>,
	lkml <linux-ker...@vger.kernel.org>
X-Original-Date: Wed, 17 Jul 2002 09:10:08 -0300 (BRT)
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Russell King <r...@arm.linux.org.uk>
Lines: 31

On Wed, 17 Jul 2002, Russell King wrote:

> I'm puzzling over this difference:
>
> > --- /dev/null	Thu Aug 30 13:30:55 2001
> > +++ 2.5.26-akpm/include/asm-arm/proc-armv/rmap.h	Tue Jul 16 21:59:40 2002

Then I guess I messed up the ARM rmap.h for 2.5.

I knew it had to be different than the 2.4 one somehow and
was under the impression that you changed the pagetable
layout in 2.5 to have "4 kB page tables" with 2 kB hardware
and 2 kB software page tables in the same page.

The page->mm thing is a stupid, stupid typo.

I guess akpm didn't have an ARM machine for testing, either ;)

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Date: Wed, 17 Jul 2002 14:30:08 +0200
From: Rik van Riel <r...@conectiva.com.br>
X-X-Sender: r...@imladris.surriel.com
Subject: Re: [patch 1/13] minimal rmap
In-Reply-To: <3D3500AA.131CE2EB@zip.com.au>
Message-ID: <Pine.LNX.4.44L.0207170914060.12241-100000@imladris.surriel.com>
X-Spambait: aardv...@kernelnewbies.org
X-Spammeplease: aardv...@nl.linux.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: a.897.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!
news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.icl.net!
newsfeed.fjserv.net!news.mailgate.org!bofh.it!robomod
References: <3D3500AA.131CE2EB@zip.com.au>
X-Original-Cc: Linus Torvalds <torva...@transmeta.com>,
	lkml <linux-ker...@vger.kernel.org>
X-Original-Date: Wed, 17 Jul 2002 09:21:50 -0300 (BRT)
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Andrew Morton <a...@zip.com.au>
Lines: 59

On Tue, 16 Jul 2002, Andrew Morton wrote:

> The rest of the VM - list management, the classzone concept, etc
> remains unchanged.

> 5: per-cpu pte_chain freelists (Rik?)

Will look into this soon.

> 6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)

And probably into this, if it turns out that we're wasting
too much memory in no longer used pte_chains in real workloads,
which will probably happen ;)

> 7: multithread the page reclaim code.  (I have patches).

Rmap for 2.4 also has some code which could be used for this.

> 8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
>    often well-ordered-by-virtual-address on the LRU, so it "just
>    works" for benchmarky loads.  But there may be some other loads...

Benchmarky loads without a working set probably aren't all that
suitable for evaluating page replacement. VM (and general caching)
works _because_ of the working set property.

Does anybody know of a working set simulator we could use to test
things like this ?

> 9: Fix bad IO latency in page reclaim (I have lame patches)
>
> 10: Develop tuning tools, use them.
>
> 11: The nightly updatedb run is still evicting everything.

That's the "minimal" part of "minimal rmap" ;))

Ed Tomlinson has some code for 11), which should be mergeable
soon.  In combination with changed page replacement priorities
we'll be able to make sure updatedb won't evict everything.

The importance of rmap here is making sure we're _able_ to do
this kind of tuning instead of tweaking the same magic knobs
we've (unsuccessfully) tweaked in the last 8 years.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/