shm bug introduced with pagecache in 2.3.11

[Patch] shm bug introduced with pagecache in 2.3.11
Christoph Rohland (hans-christoph.rohland@sap.com)
11 Nov 1999 16:43:08 +0100 

--Multipart_Thu_Nov_11_16:43:08_1999-1
Content-Type: text/plain; charset=US-ASCII

Hi Linus,


Finally shm swapping seems to work for me again. The following patch
fixes a refcounting bug which got introduced with 2.3.11. (Thanks to
Manfred who finally found the right point). It survived a lot of swap
stress testing on UP/32MB up to 8xSMP/8GB.


The patch also fixes some int/size_t issues.


Could you please apply this.


Greetings
Christoph



--Multipart_Thu_Nov_11_16:43:08_1999-1
Content-Type: text/plain; charset=US-ASCII
Content-Disposition: attachment; filename="patch-27.6-shm4"
Content-Transfer-Encoding: 7bit


--- 2.3.27-pre6/ipc/shm.c Thu Nov 11 10:33:16 1999
+++ make27/ipc/shm.c Thu Nov 11 14:47:57 1999
@@ -206,7 +206,7 @@
struct shmid_kernel *shp;
int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
int id, err;
- unsigned int shmall, shmmni;
+ size_t shmall, shmmni;

shmall = shm_prm[1];
shmmni = shm_prm[2];
@@ -378,13 +378,16 @@
case IPC_INFO:
{
struct shminfo shminfo;
+ size_t shmmax;
+
spin_unlock(&shm_lock);
err = -EFAULT;
if (!buf)
goto out;

+ shmmax=shm_prm[0];
+ shminfo.shmmax = shmmax > UINT_MAX ? UINT_MAX : shmmax;
shminfo.shmmni = shminfo.shmseg = shm_prm[2];
- shminfo.shmmax = shm_prm[0];
shminfo.shmall = shm_prm[1];

shminfo.shmmin = SHMMIN;
@@ -791,11 +794,14 @@
if (!page) {
lock_kernel();
swapin_readahead(entry);
+ if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx))) goto again;
page = read_swap_cache(entry);
unlock_kernel();
if (!page)
goto oom;
}
+ if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx)))
+ goto changed;
delete_from_swap_cache(page);
page = replace_with_highmem(page);
lock_kernel();
@@ -803,9 +809,6 @@
unlock_kernel();
spin_lock(&shm_lock);
shm_swp--;
- pte = SHM_ENTRY(shp, idx);
- if (pte_present(pte))
- goto present;
}
shm_rss++;
pte = pte_mkdirty(mk_pte(page, PAGE_SHARED));
@@ -813,8 +816,6 @@
} else
--current->maj_flt; /* was incremented in do_no_page */

-done:
- /* pte_val(pte) == SHM_ENTRY (shp, idx) */
get_page(pte_page(pte));
spin_unlock(&shm_lock);
current->min_flt++;
@@ -823,10 +824,6 @@
changed:
__free_page(page);
goto again;
-present:
- if (page)
- free_page_and_swap_cache(page);
- goto done;
oom:
return NOPAGE_OOM;
}



--Multipart_Thu_Nov_11_16:43:08_1999-1--


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Andrea Arcangeli (andrea@suse.de)
Thu, 11 Nov 1999 22:12:20 +0100 (CET) 

On 11 Nov 1999, Christoph Rohland wrote:

>The patch also fixes some int/size_t issues.


The patch is buggy. In this path:


swapin_readahead(entry);
+ if (pte_val(pte) != pte_val(SHM_ENTRY(shp,
idx))) goto again;
page = read_swap_cache(entry);


you `goto again` without first releasing the big kernel lock and without
acquiring again the shm lock.


But this is a minor implementation issue. There is a worse problem. The
real issue is that in SMP even removing the readahead is still racy. All
the checks for the pte you added are racy.


This my patch should fix all races (both UP and SMP). Please try it out.
It's against 2.3.27pre4. As the anonymous swapin we are protected by the
per-mm semaphore, in shm we must protect us with a per-shm-segment
semaphore to handle the swapin case safely. The design is almost the same
as in the anonymous swapin then.


diff -urN 2.3.27pre4/include/linux/swap.h shm/include/linux/swap.h
--- 2.3.27pre4/include/linux/swap.h Thu Nov 11 18:23:09 1999
+++ shm/include/linux/swap.h Thu Nov 11 21:20:36 1999
@@ -112,8 +112,10 @@
extern struct swap_info_struct swap_info[];
extern int is_swap_partition(kdev_t);
extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
-extern void swap_free(swp_entry_t);
+extern swp_entry_t __get_swap_page(unsigned short);
+#define get_swap_page() __get_swap_page(1)
+extern void __swap_free(swp_entry_t, unsigned short);
+#define swap_free(entry) __swap_free((entry), 1)
struct swap_list_t {
int head; /* head of priority-ordered swapfile list */
int next; /* swapfile to be used next */
diff -urN 2.3.27pre4/ipc/shm.c shm/ipc/shm.c
--- 2.3.27pre4/ipc/shm.c Wed Nov 10 16:59:27 1999
+++ shm/ipc/shm.c Thu Nov 11 21:58:28 1999
@@ -36,6 +36,7 @@
pte_t **shm_dir; /* ptr to array of ptrs to frames -> SHMMAX */ 
struct vm_area_struct *attaches; /* descriptors for attaches */
int id; /* backreference to id for shm_close */
+ struct semaphore sem;
};

static int findkey (key_t key);
@@ -61,6 +62,9 @@
static unsigned int num_segs = 0;
static unsigned short shm_seq = 0; /* incremented, for recognizing stale ids */

+/* locks order:
+ shm_lock -> pagecache_lock (end of shm_swap)
+ shp->sem -> other spinlocks (shm_nopage) */
spinlock_t shm_lock = SPIN_LOCK_UNLOCKED;

/* some statistics */
@@ -260,6 +264,7 @@
shp->u.shm_ctime = CURRENT_TIME;
shp->shm_npages = numpages;
shp->id = id;
+ init_MUTEX(&shp->sem);

spin_lock(&shm_lock);

@@ -770,10 +775,13 @@
idx = (address - shmd->vm_start) >> PAGE_SHIFT;
idx += shmd->vm_pgoff;

+ down(&shp->sem);
spin_lock(&shm_lock);
-again:
pte = SHM_ENTRY(shp,idx);
if (!pte_present(pte)) {
+ /* page not present so shm_swap can't race with us
+ and the semaphore protects us by other tasks that
+ could potentially fault on our pte under us */
if (pte_none(pte)) {
spin_unlock(&shm_lock);
page = get_free_highpage(GFP_HIGHUSER);
@@ -781,8 +789,6 @@
goto oom;
clear_highpage(page);
spin_lock(&shm_lock);
- if (pte_val(pte) != pte_val(SHM_ENTRY(shp, idx)))
- goto changed;
} else {
swp_entry_t entry = pte_to_swp_entry(pte);

@@ -803,9 +809,6 @@
unlock_kernel();
spin_lock(&shm_lock);
shm_swp--;
- pte = SHM_ENTRY(shp, idx);
- if (pte_present(pte))
- goto present;
}
shm_rss++;
pte = pte_mkdirty(mk_pte(page, PAGE_SHARED));
@@ -813,21 +816,15 @@
} else
--current->maj_flt; /* was incremented in do_no_page */

-done:
/* pte_val(pte) == SHM_ENTRY (shp, idx) */
get_page(pte_page(pte));
spin_unlock(&shm_lock);
+ up(&shp->sem);
current->min_flt++;
return pte_page(pte);

-changed:
- __free_page(page);
- goto again;
-present:
- if (page)
- free_page_and_swap_cache(page);
- goto done;
oom:
+ up(&shp->sem);
return NOPAGE_OOM;
}

@@ -851,7 +848,11 @@
if (!counter)
return 0;
lock_kernel();
- swap_entry = get_swap_page();
+ /* subtle: preload the swap count for the swap cache. We can't
+ increase the count inside the critical section as we can't release
+ the shm_lock there. And we can't acquire the big lock with the
+ shm_lock held (otherwise we would deadlock too easily). */
+ swap_entry = __get_swap_page(2);
if (!swap_entry.val) {
unlock_kernel();
return 0;
@@ -893,7 +894,7 @@
failed:
spin_unlock(&shm_lock);
lock_kernel();
- swap_free(swap_entry);
+ __swap_free(swap_entry, 2);
unlock_kernel();
return 0;
}
@@ -905,11 +906,16 @@
swap_successes++;
shm_swp++;
shm_rss--;
+
+ /* add the locked page to the swap cache before allowing
+ the swapin path to run lookup_swap_cache(). This avoids
+ reading a not yet uptodate block from disk.
+ NOTE: we just accounted the swap space reference for this
+ swap cache page at __get_swap_page() time. */
+ add_to_swap_cache(page_map, swap_entry);
spin_unlock(&shm_lock);

lock_kernel();
- swap_duplicate(swap_entry);
- add_to_swap_cache(page_map, swap_entry);
rw_swap_page(WRITE, page_map, 0);
unlock_kernel();

diff -urN 2.3.27pre4/mm/swapfile.c shm/mm/swapfile.c
--- 2.3.27pre4/mm/swapfile.c Sun Nov 7 17:33:38 1999
+++ shm/mm/swapfile.c Thu Nov 11 21:38:01 1999
@@ -25,7 +25,7 @@

#define SWAPFILE_CLUSTER 256

-static inline int scan_swap_map(struct swap_info_struct *si)
+static inline int scan_swap_map(struct swap_info_struct *si, unsigned short count)
{
unsigned long offset;
/* 
@@ -73,7 +73,7 @@
si->lowest_bit++;
if (offset == si->highest_bit)
si->highest_bit--;
- si->swap_map[offset] = 1;
+ si->swap_map[offset] = count;
nr_swap_pages--;
si->cluster_next = offset+1;
return offset;
@@ -81,7 +81,7 @@
return 0;
}

-swp_entry_t get_swap_page(void)
+swp_entry_t __get_swap_page(unsigned short count)
{
struct swap_info_struct * p;
unsigned long offset;
@@ -94,11 +94,13 @@
goto out;
if (nr_swap_pages == 0)
goto out;
+ if (count >= SWAP_MAP_MAX)
+ goto bad_count;

while (1) {
p = &swap_info[type];
if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
- offset = scan_swap_map(p);
+ offset = scan_swap_map(p, count);
if (offset) {
entry = SWP_ENTRY(type,offset);
type = swap_info[type].next;
@@ -123,10 +125,15 @@
}
out:
return entry;
+
+bad_count:
+ printk(KERN_ERR "get_swap_page: bad count %hd from %p\n",
+ count, __builtin_return_address(0));
+ goto out;
}


-void swap_free(swp_entry_t entry)
+void __swap_free(swp_entry_t entry, unsigned short count)
{
struct swap_info_struct * p;
unsigned long offset, type;
@@ -148,7 +155,9 @@
if (!p->swap_map[offset])
goto bad_free;
if (p->swap_map[offset] < SWAP_MAP_MAX) {
- if (!--p->swap_map[offset]) {
+ if (p->swap_map[offset] < count)
+ goto bad_count;
+ if (!(p->swap_map[offset] -= count)) {
if (offset < p->lowest_bit)
p->lowest_bit = offset;
if (offset > p->highest_bit)
@@ -170,6 +179,9 @@
goto out;
bad_free:
printk("VM: Bad swap entry %08lx\n", entry.val);
+ goto out;
+bad_count:
+ printk(KERN_ERR "VM: Bad count %hd current count %hd\n", count, p->swap_map[offset]);
goto out;
}



The only ordering rule I added is that shm_lock must be acquired _before_
pagecache_lock.


I am stressing the code with your shmtst on SMP and it works fine here.


I suggest applying my race fixes to the stock kernel as the design looks
like the right one to me now.


Andrea



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred (manfreds@colorfullife.com)
Fri, 12 Nov 1999 05:05:34 -0500 (EST) 

> 
> On 11 Nov 1999, Christoph Rohland wrote:
> 
> >The patch also fixes some int/size_t issues.
> But this is a minor implementation issue. There is a worse problem. The
> real issue is that in SMP even removing the readahead is still racy. All
> the checks for the pte you added are racy.
The current code is UP only. There are new ipc helper function in
ipc/util.h and I'll convert the code RSN.

> 
> This my patch should fix all races (both UP and SMP). Please try it out.
> It's against 2.3.27pre4. As the anonymous swapin we are protected by the
> per-mm semaphore, in shm we must protect us with a per-shm-segment
> semaphore to handle the swapin case safely. The design is almost the same
> as in the anonymous swapin then.
Intersting idea. I thought about acquiring the kernel lock a bit earlier,
but perhaps I can avoid that with a semaphore.

> 
> The only ordering rule I added is that shm_lock must be acquired _before_
> pagecache_lock.
> 
Yes.


> I am stressing the code with your shmtst on SMP and it works fine here.
> 
> I suggest applying my race fixes to the stock kernel as the design looks
> like the right one to me now.
> 
I don't like the semaphore, because (AFAICS, I'm only looking at the diff)
you single-thread the swapin code (per-segment, but still single thread)


I'll think about it,


Manfred


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Linus Torvalds (torvalds@transmeta.com)
Fri, 12 Nov 1999 04:09:24 -0800 (PST) 

On Fri, 12 Nov 1999, Manfred wrote:
> 
> I don't like the semaphore, because (AFAICS, I'm only looking at the diff)
> you single-thread the swapin code (per-segment, but still single thread)

I think the semaphore is a good idea, if only because it makes things much
more obviously correct - exactly because of the clear serialization. And I
don't think the serialization is a performance problem, because by the
time you start paging we're not talking about high performance shared
memory anyway, and because it's per-segment it is notgoing to make
"system" performance any worse.

In fact, my reaction to the semaphore is "do we actually need the
spinlock any more"?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred Spraul (manfreds@colorfullife.com)
Fri, 12 Nov 1999 16:05:22 +0100 

Linus Torvalds wrote:
> 
> On Fri, 12 Nov 1999, Manfred wrote:
> >
> > I don't like the semaphore, because (AFAICS, I'm only looking at the diff)
> > you single-thread the swapin code (per-segment, but still single thread)
> 
> I think the semaphore is a good idea, if only because it makes things much
> more obviously correct - exactly because of the clear serialization.

I agree that the current code is a total mess (I have converted it to
the ipc/util.h helper functions, and I found further SMP and UP races)
_if_ I find a simple serialization, then I'll kill the semaphore.


> And I don't think the serialization is a performance problem, because
> by the time you start paging we're not talking about high performance
> shared memory anyway, and because it's per-segment it is notgoing to make
> "system" performance any worse.
>
What about a 100-gigabyte shm segment (on a 64-bit platform) with a fast
scsi disk system? The semaphore will prevent any tagged commands, and it
will downgrade (performance wise) the scsi system to a slow ide disk.


Btw, I'm sure that for multi-threaded applications, the mmap performance
of Linux will be poor because everything is single-threaded. I'll
write a benchmark and compare it with WinNT/Win95.


>
> In fact, my reaction to the semaphore is "do we actually need the
> spinlock any more"?
> 
shm_swap() must not acquire a semaphore, or we could lock-up during
low-memory.



--
Manfred


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Linus Torvalds (torvalds@transmeta.com)
Fri, 12 Nov 1999 10:32:16 -0800 (PST) 

On Fri, 12 Nov 1999, Manfred Spraul wrote:
> 
> > And I don't think the serialization is a performance problem, because
> > by the time you start paging we're not talking about high performance
> > shared memory anyway, and because it's per-segment it is notgoing to make
> > "system" performance any worse.
>
> What about a 100-gigabyte shm segment (on a 64-bit platform) with a fast
> scsi disk system? The semaphore will prevent any tagged commands, and it
> will downgrade (performance wise) the scsi system to a slow ide disk.

Nope. The swap-in read-ahead still works - the _only_ thing the semaphore
does is serialize different processes accessing the same area, and that's
as likely to improve performace as to degrade it (potentially less
seeking).


> Btw, I'm sure that for multi-threaded applications, the mmap performance
> of Linux will be poor because everything is single-threaded. I'll
> write a benchmark and compare it with WinNT/Win95.


I will bet you 5 bucks we'll kick ass.


Linus



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred Spraul (manfreds@colorfullife.com)
Sat, 13 Nov 1999 02:09:25 +0100 

Linus Torvalds wrote:
> 
> > Btw, I'm sure that for multi-threaded applications, the mmap performance
> > of Linux will be poor because everything is single-threaded. I'll
> > write a benchmark and compare it with WinNT/Win95.
> 
> I will bet you 5 bucks we'll kick ass.
> 
You've lost:

Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu
Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(],
620,000,000 bytes test file, fat filesystem, the same disk is used for
NT and Linux.

command: "./pagein fill 150000 #" where fill is the filename, 150000
means 150000 pages are trashed, and # is the number of threads.

Linux:
# pages/sec
1 13
4 14
64 14
256 ? [computer unresponsive]

NT:
# pages/sec
1 18
4 20
64 28
256 31
512 33

Linux is slower, and it cannot use multiple threads to reorder the
sector reads;
NT gets faster if I add further threads.

source code is at
http://colorfullife.com/~manfreds/pagein/pagein.cpp

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Alan Cox (alan@lxorguk.ukuu.org.uk)
Sat, 13 Nov 1999 01:33:20 +0000 (GMT) 

> You've lost:

> Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu
> Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(],


So you benchmarked with a very slow I/O device. Ok that should mean its
silly numbers for both tied entirely to the seek rate of the media


> 620,000,000 bytes test file, fat filesystem, the same disk is used for
> NT and Linux.


Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would
be interesting.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred (manfreds@colorfullife.com)
Sat, 13 Nov 1999 09:48:58 +0100 

From: Alan Cox <alan@lxorguk.ukuu.org.uk>
> > You've lost:
>
> > Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu
> > Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(],
>
> So you benchmarked with a very slow I/O device. Ok that should mean its
> silly numbers for both tied entirely to the seek rate of the media
>
Yes, intentionally, that was the slowest disk I found:
Linux single-threads the pageing-io, ie it cannot reorder the read
operations.
I wrote that this is a huge disadvantage, and the numbers show that.

> > 620,000,000 bytes test file, fat filesystem, the same disk is used for
> > NT and Linux.
>
> Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would
> be interesting.
I'll try it with a faster disk, but initial tests show that :
- NT gets faster if I add further threads
- Linux cannot reorder the disk io, and it remains at the same performance
for 1 thread and for 64 threads.
- the benchmark is io bound, ie the internal efficiency of the os doesn't
matter.


Jeff Garzik wrote:
> Is this test done on kernel 2.3.28?
2.3.27


--
Manfred




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Alan Cox (alan@lxorguk.ukuu.org.uk)
Sat, 13 Nov 1999 14:21:11 +0000 (GMT) 

> Yes, intentionally, that was the slowest disk I found:
> Linux single-threads the pageing-io, ie it cannot reorder the read
> operations.
> I wrote that this is a huge disadvantage, and the numbers show that.

Ok now I understand what you are trying to show. That would make sense.


Alan



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred Spraul (manfreds@colorfullife.com)
Sat, 13 Nov 1999 16:15:47 +0100 

Alan Cox wrote:
> So you benchmarked with a very slow I/O device.
>
> Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would
> be interesting.

Ok, I switched to a Seagate ST34520N (7200 rpm, scsi2 narrow, 4.5 GB),
and I added a new test: Linux-multi-thread vs Linux-multi-process. The
results are as I expected:


-Linux-multi-process is more or less on par with NT. The 20% difference
could be the thread/process overhead.
-Linux-multi-thread is sloww.


450000 pages test file, ext2 and NTFS, 128 MB ram, Sym810 controller,
AMD K6/200


# is the number of threads/processes which are running.


# Linux-threads Linux-processes NT (threads) 
1 51 51 60
16 51 67 96
64 50 73 105
128 48 75 107


The modified source code is at
http://colorfullife.com/~manfreds/pagein/pagein.cpp



--
Manfred


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Gerard Roudier (groudier@club-internet.fr)
Sat, 13 Nov 1999 18:49:10 +0100 (MET) 

Hi Manfred,

Could it be possible for you to run benchmarks against O/Ses we have
access to the source code instead of binary-only available ones. This
would allow to learn a lot better from the differences. For example
FreeBSD is as simple as Redhat to install and a base system will consume
far less disk space than NT.

Basically I an not interested at all by your benchmarks for the reasons my
personnal box has only free O/Ses installed.

May-be you will reply me that Linux is mostly competing against NT
nowadays. Anyway, ignoring other free O/Ses seems to me scornfully 
given the synergy that existed and still exists in some places. 

G�rard.

On Sat, 13 Nov 1999, Manfred Spraul wrote:

> Alan Cox wrote:
> > So you benchmarked with a very slow I/O device.
> >
> > Linux FAT performance is slow. Try NTFS (or FAT) versus ext2. That would
> > be interesting.
> 
> Ok, I switched to a Seagate ST34520N (7200 rpm, scsi2 narrow, 4.5 GB),
> and I added a new test: Linux-multi-thread vs Linux-multi-process. The
> results are as I expected:
> 
> -Linux-multi-process is more or less on par with NT. The 20% difference
> could be the thread/process overhead.
> -Linux-multi-thread is sloww.
> 
> 450000 pages test file, ext2 and NTFS, 128 MB ram, Sym810 controller,
> AMD K6/200
> 
> # is the number of threads/processes which are running.
> 
> # Linux-threads Linux-processes NT (threads) 
> 1 51 51 60
> 16 51 67 96
> 64 50 73 105
> 128 48 75 107
> 
> The modified source code is at
> http://colorfullife.com/~manfreds/pagein/pagein.cpp
> 
> --
> Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred Spraul (manfreds@colorfullife.com)
Sat, 13 Nov 1999 18:55:53 +0100 

Gerard Roudier wrote:
> 
> Hi Manfred,
> 
> Could it be possible for you to run benchmarks against O/Ses we have
> access to the source code instead of binary-only available ones. This
> would allow to learn a lot better from the differences. For example
> FreeBSD is as simple as Redhat to install and a base system will consume
> far less disk space than NT.
> 
Source code is at http://colorfullife.com/~manfreds/pagein/pagein.cpp;
I don't have FreeBSD.


--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Dominik Kubla (dominik.kubla@uni-mainz.de)
Sat, 13 Nov 1999 19:11:49 +0100 

On Sat, Nov 13, 1999 at 06:55:53PM +0100, Manfred Spraul wrote:
> Gerard Roudier wrote:
> > 
> > Hi Manfred,
> > 
> > Could it be possible for you to run benchmarks against O/Ses we have
> > access to the source code instead of binary-only available ones. This
> > would allow to learn a lot better from the differences. For example
> > FreeBSD is as simple as Redhat to install and a base system will consume
> > far less disk space than NT.
> > 
> Source code is at http://colorfullife.com/~manfreds/pagein/pagein.cpp;
> I don't have FreeBSD.

Gerard was referring to the source code of the _OS_, not your benchmark!
And i have to agree with him: There is no way to understand what a OS
is really doing without looking at the source. (Reminds me of our X11
benches back in the "old times": only be running them on really slow
hardware we could see that some commercial servers were "optimized for
benchmarks" - they simply skipped some drawing operations. DOH!)

As for not having FreeBSD: simply look at www.freebsd.org...

Yours,
Dominik Kubla

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred Spraul (manfreds@colorfullife.com)
Sat, 13 Nov 1999 21:00:52 +0100 

Dominik Kubla wrote:
> Gerard was referring to the source code of the _OS_, not your benchmark!
> And i have to agree with him: There is no way to understand what a OS
> is really doing without looking at the source.

In this case you don't need the source code:
Do you have a really noisy drive with a slow seek time?
Then you would hear the difference:
- WinNT and Linux-fork sound 'round' with lots of threads/processes, and
the performance increases.


- Linux-multithread always sounds identical (1 thread or 64); the
performance doesn't change.


You don't need to be a rocket scientist to figure out that the cause is
the mmap semaphore, ie that Linux single threads the io for
multi-threaded applications.
Linux with multiple processes or WinNT reorder the disk io, and thus
they get faster with more processes/threads.



--
Manfred
P.S.: if you prefer to look at the source, then compare Linux-fork and
Linux-multithread.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Linus Torvalds (torvalds@transmeta.com)
Thu, 18 Nov 1999 19:29:15 -0800 (PST) 

On Sat, 13 Nov 1999, Manfred Spraul wrote:
>
> Computer: K6-200, 128 MB Ram, Symbios 810 scsi controller, Fujitsu
> Magneto-Optical drive, 620 MB [I have no empty scsi disc left :(],
> 620,000,000 bytes test file, fat filesystem, the same disk is used for
> NT and Linux.

Re-do this without the ridiculous filesystem, and I'll bother to even
check the numbers.


That said, I don't think this can/will be fixed for a 2.4 timeframe,
especially as I haven't heard of any real-life usage where it would be an
issue..


Linus



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Alan Cox (alan@lxorguk.ukuu.org.uk)
Fri, 19 Nov 1999 12:33:55 +0000 (GMT) 

> That said, I don't think this can/will be fixed for a 2.4 timeframe,
> especially as I haven't heard of any real-life usage where it would be an
> issue..

News servers like Typhoon , high performance threaded web servers (eg Zeus)


Fortunately these guys tend to be using pretty serious I/O subsystems not
M/O disks and they are fine with 2.2.


Alan




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Manfred Spraul (manfreds@colorfullife.com)
Fri, 19 Nov 1999 15:36:54 +0100 

Alan Cox wrote:
> 
> > That said, I don't think this can/will be fixed for a 2.4 timeframe,
> > especially as I haven't heard of any real-life usage where it would be an
> > issue..
> 
> News servers like Typhoon , high performance threaded web servers (eg Zeus)
> 

Do you know if they are using mmap?

>
> Fortunately these guys tend to be using pretty serious I/O subsystems not
> M/O disks and they are fine with 2.2.
> 

I did a second test with a faster disk (SCSI-2-narrow 4.5 GB seagate),
and the results were nearly identical: the mmap semaphore kill's around
33% performance if I compare 64 threads with 64 processes. (33% slower
or 50% faster, depending on your point of view)
Please note that the test is extremely I/O bound, ie I defeat read-ahead
with a RNG, and I only read one byte in every page, and the file is far
larger than available memory.

I'll try to find a faster drive (I had somewhere an old 10kRPM wide
SCSI drive), but I would be surprised if the performance drop would be <
30%.

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Alan Cox (alan@lxorguk.ukuu.org.uk)
Fri, 19 Nov 1999 14:40:15 +0000 (GMT) 

> > News servers like Typhoon , high performance threaded web servers (eg Zeus)
> 
> Do you know if they are using mmap?

Yes. Typhoon uses threaded mmap so aggressively it became an unintentional
test suite for the Linux mm layer, and in 2.0/2.1 it found a lot of bugs.


> Please note that the test is extremely I/O bound, ie I defeat read-ahead
> with a RNG, and I only read one byte in every page, and the file is far
> larger than available memory.


I would expect Typhoon to show some reasonably sane locality



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Linus Torvalds (torvalds@transmeta.com)
Fri, 19 Nov 1999 11:25:45 -0800 (PST) 

On Fri, 19 Nov 1999, Alan Cox wrote:
> 
> News servers like Typhoon , high performance threaded web servers (eg Zeus)
> 
> Fortunately these guys tend to be using pretty serious I/O subsystems not
> M/O disks and they are fine with 2.2.

Well, the more I look at a read-write semaphore, the more I like it: it
looks like something that once the semaphore implementation itself was
done, the MM side would be absolutely trivial. It does introduce a new
issue (multiple threads updating the page tables at the same time), but
that one doesn't look that horrible..

We don't ever export the page table handling to the low-level filesystems
any more (we used to a long time ago: the nopage() function got to touch
the page tables itself rather than just return the right page), so fixing
up the new issue is actually a very local fix in mm/mmeory.c.

Is anybody willing to take a stab at creating a read-write semaphore?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Marcelo Tosatti <marc...@conectiva.com.br>
Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11
Date: 1999/11/20
Message-ID: <fa.kcmkl8v.v72q3f@ifi.uio.no>#1/1
X-Deja-AN: 552020873
Original-Date: Sat, 20 Nov 1999 09:40:07 -0200 (BRDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.20.9911200922480.3198-100000@freak.conectiva>
References: <fa.oa9df7v.ika7b9@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: marc...@freak.conectiva
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

> > 
> > News servers like Typhoon , high performance threaded web servers (eg Zeus)
> > 
> > Fortunately these guys tend to be using pretty serious I/O subsystems not
> > M/O disks and they are fine with 2.2.
> 
> Well, the more I look at a read-write semaphore, the more I like it: it
> looks like something that once the semaphore implementation itself was
> done, the MM side would be absolutely trivial. It does introduce a new
> issue (multiple threads updating the page tables at the same time), but
> that one doesn't look that horrible..
> 
> We don't ever export the page table handling to the low-level filesystems
> any more (we used to a long time ago: the nopage() function got to touch
> the page tables itself rather than just return the right page), so fixing
> up the new issue is actually a very local fix in mm/mmeory.c.
> 
> Is anybody willing to take a stab at creating a read-write semaphore?
> 
> 		Linus
http://bazar.conectiva.com.br/~marcelo/rwsem-2.3.18ac7.patch
This code is a Linux "port" of the psedo-code implementation found in the
"Unix Kernel Internals" book i wrote some time ago. The patch also
modifies the "uts_sem" semaphore in kernel/sys.c to a rw semaphore.
I've not tested it extensively so there might be ugly bugs/races.
Any construtive comments/bug reports are welcome.


  - Marcelo



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] shm bug introduced with pagecache in 2.3.11
Linus Torvalds (torvalds@transmeta.com)
Sat, 20 Nov 1999 16:33:49 -0800 (PST) 

On Sat, 20 Nov 1999, Marcelo Tosatti wrote:
>
> http://bazar.conectiva.com.br/~marcelo/rwsem-2.3.18ac7.patch
> This code is a Linux "port" of the psedo-code implementation found in the
> "Unix Kernel Internals" book i wrote some time ago.

Well, if it's a port of that, then it won't have the 2-instruction
fast-path that is pretty much required, imho.

I'll see if I can get a free afternoon some day and try to port the
current x86 semaphore code over to a rw version too. The plan was
something like this:

- read_down():

lock ; incl mem
js contention_rw

- read_up():

lock ; decl mem
js wake_up_writer

- write_down():

lock ; btsl $31,mem
jc contention_ww
testl $0x7fffffff,mem
jne contention_wr

- write_up():

lock ; andl $0x7fffffff,mem
jne wake_up_reader_or_writer

where all the three contention cases grab a "contention spinlock" before
they then start sorting things out. The only interesting part is making
sure that the contention case gets the wakeups, and the above counts on:

- if a writer is waiting for readers (contention_wr), then the writer
will have already set the high bit, and a reader will know to wake it
up because the rw-semaphore value will be negative when it does
read_up().

- if a reader is waiting for a writer, then the reader will have
incremented the semaphore, and the writer will know to wake it up
becasue the semaphore value won't be zero after the "write_up()".

- if a writer is waiting for another writer (contention_ww case), it will
have to increment the "reader" part of the semaphore value, in order to
get the other writer to wake it up on "write_up()".

All other races should be trivially handled by just having the spinlock,
so the only really hard cases are the fast-path stuff where we cannot get
the semaphore because it is too expensive.

Does anybody see any holes in the above pseudo-implementation? Please take
a look at the way the current x86 semaphores are implemented: they use
exactly the above kinds of single-atomic-instruction-plus-condition-codes
trickery to get the non-contention case without _any_ extra instructions.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11
Date: 1999/11/25
Message-ID: <fa.jrengav.80uk26@ifi.uio.no>#1/1
X-Deja-AN: 552934330
Original-Date: Thu, 25 Nov 1999 14:33:51 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9911250353370.21876-100000@alpha.random>
References: <fa.oa9df7v.ika7b9@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 19 Nov 1999, Linus Torvalds wrote:

>Well, the more I look at a read-write semaphore, the more I like it: it
>looks like something that once the semaphore implementation itself was
>done, the MM side would be absolutely trivial. It does introduce a new
>issue (multiple threads updating the page tables at the same time), but
>that one doesn't look that horrible..

If you allow more than one task to fault for example in the swapin path
you'll get in troubles as you can't solve this race cleanly with a
spinlock. That's why I added the semaphore to the shm segments in first
place.

Only replacing the down() with a read_down() in do_page_fault is _not_
enough. The semaphore is not there only to protect from mmap and vma
changes under us, right now it's there mainly to protect other threads to
fault under us.

IMHO the semaphore make a performance difference only with threads doing
paging of mmapped files while fooling readahead. The swapin case is not
intersting IMHO (and we do readahead also for the swapins). Maybe we can
find a way to drop the semaphore in the nopage path. The read semaphore in
do_page_fault make not too much sense to me as we should do really tricky
code to solve the races by hand without a performance advantage in RL.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11
Date: 1999/11/25
Message-ID: <fa.obabh7v.gkc5b0@ifi.uio.no>#1/1
X-Deja-AN: 552991565
Original-Date: Thu, 25 Nov 1999 09:20:57 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9911250913590.4390-100000@penguin.transmeta.com>
References: <fa.jrengav.80uk26@ifi.uio.no>
To: Andrea Arcangeli <and...@suse.de>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 25 Nov 1999, Andrea Arcangeli wrote:
> 
> If you allow more than one task to fault for example in the swapin path
> you'll get in troubles as you can't solve this race cleanly with a
> spinlock. That's why I added the semaphore to the shm segments in first
> place.

No, you can solve it cleanly by just changing the code: you only really
need to guarantee that the mapping doesn't change under you (that would be
disastrous and very hard to recover from). Somebody else filling in the
page before you is simple to check for.

> Only replacing the down() with a read_down() in do_page_fault is _not_
> enough. The semaphore is not there only to protect from mmap and vma
> changes under us, right now it's there mainly to protect other threads to
> fault under us.

"mainly" is incorrect. The main protection is to maintain the vma list
sanely, that was always the case (it used to be easy to crash the kernel
by using threads that pagefaulted and mmap'ed at the same time).

Protecting against others paging in is trivial, and in fact we used to do
that as long ago as 1.2.x if I remember correctly (the mm code was very
different back then). The way we used to do that was to remember the
original pte value, and before updating it with the newpage that was just
paged in we just check that the pte value hasn't changed.

In 1.2.x that protected us against threads that paged in simultaneously,
and the races introduced by the IO waiting. But it was not enough to
protect against mmap's changing the vma, so we introduced the semaphore in
1.3.x, and because we had the semaphore we could also remove the
optimistic checking.

In 2.3.x, we can use the same trivial approach to protect against threads.
It adds basically no overhead at all - we have to get the spinlock anyway,
and the final check before changing the page tables is basically a single
load and compare.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11
Date: 1999/11/25
Message-ID: <fa.jqutfav.9g8l2b@ifi.uio.no>#1/1
X-Deja-AN: 553001796
Original-Date: Thu, 25 Nov 1999 18:18:44 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9911251808140.22916-100000@alpha.random>
References: <fa.obabh7v.gkc5b0@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 25 Nov 1999, Linus Torvalds wrote:

>In 2.3.x, we can use the same trivial approach to protect against threads.

For the allocation is trivial of course (I was just doing that in shm.c).

But I am not been trivially succesfully in fixing the shm swapin races
with "read pte with spinlock acquired, release the spinlock, reacquire the
spinlock and the check if the pte is changed". That's why I added the
spinlock. The _main_ problem I had is that to swapout we have to grab the
kernel lock and we'll sleep and so I would need to acquire the spinlocks
in inverse order (deadlock prone). So I givenup and I took the _trivial_
mainstream way to use the semaphore to protect multiple thread accesses
(also for shm.c using a semaphore is less interesting as shm.c can't do
I/O in the nopage operation unless it's a swapin).

I hope I was missing something and that's simpler...

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@suse.de>
Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11
Date: 1999/11/25
Message-ID: <fa.jpv1fiv.8g4lq6@ifi.uio.no>#1/1
X-Deja-AN: 553001798
Original-Date: Thu, 25 Nov 1999 18:23:56 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9911251823040.24875-100000@alpha.random>
References: <fa.jqutfav.9g8l2b@ifi.uio.no>
X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@alpha.random
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 25 Nov 1999, Andrea Arcangeli wrote:

>spinlock and the check if the pte is changed". That's why I added the
>spinlock. The _main_ problem I had is that to swapout we have to grab the
 ^^^^^^^^ of course I meant "semaphore" ;)

Andrea


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [Patch] shm bug introduced with pagecache in 2.3.11
Date: 1999/11/25
Message-ID: <fa.o99ri6v.iks4b3@ifi.uio.no>#1/1
X-Deja-AN: 553903012
Original-Date: Thu, 25 Nov 1999 09:57:10 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9911250950110.4390-100000@penguin.transmeta.com>
References: <fa.jqutfav.9g8l2b@ifi.uio.no>
To: Andrea Arcangeli <and...@suse.de>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Thu, 25 Nov 1999, Andrea Arcangeli wrote:
> 
> But I am not been trivially succesfully in fixing the shm swapin races
> with "read pte with spinlock acquired, release the spinlock, reacquire the
> spinlock and the check if the pte is changed". That's why I added the
> spinlock.

I was planning on just depending on the sanity of the page cache on this
one. Basically we have two cases:
 - paging in something new ("no_page"), for which the final test is just
   to test that the page table is still zero (ie we don't even need to
   save any "original" value).
 - paging in something old ("swap_page"), in wich case the final test is
   to check that the pte is still the same as swp_entry_to_pte(entry).

(we have the rw_page case too, but that is already protected by the
spinlock appropriately as far as I can tell, exactly because it already
has the same race wrt page_out rather than page_in).

No, I haven't checked the exact details. Maybe it's worse than I envision,
but it _looks_ like adding a simple spinlock and the test. If the test
fails, we just return and expect the fault to happen again..

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/