Fixed the race that was oopsing Linux-2.2.0

From: Andrea Arcangeli <and...@e-mind.com>
Subject: [patch] Fixed the race that was oopsing Linux-2.2.0
Date: 1999/01/27
Message-ID: <fa.ipuq36v.16gg6jj@ifi.uio.no>
X-Deja-AN: 437378208
Original-Date: Wed, 27 Jan 1999 04:02:09 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990127035109.4014A-100000@laser.bogus>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

As first the memcpy in dst (removed from pre9 to 2.2.0 final) is needed
because while get_stat and get_stats are running the process could gone
away. If we don't want to memcpy we need the tasklist_lock locked (but
the memcpy looks more obviously right to me, at least at this clock
time...).

The other problem that was harming all people is that mmget() was
incrasing the counter unconditionally. But it could increase it even if it
was just 0. So we could have an mm with count 1 (increased by /proc), but
just dealloced by do_exit() because do_exit run between the check for mm
!= &init_mm and mmget() (in get_mm_and_lock() if I remeber well).

Here the complete fix against 2.2.0 (the fix will be in 2.2.0_arca-3
soon):

Patch

Now it's really time to sleep for me....

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] Fixed the race that was oopsing Linux-2.2.0
Date: 1999/01/27
Message-ID: <fa.isg3e6v.1802oh4@ifi.uio.no>#1/1
X-Deja-AN: 437482814
Original-Date: Wed, 27 Jan 1999 11:58:34 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990127115458.181A-100000@laser.bogus>
References: <fa.ipuq36v.16gg6jj@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 27 Jan 1999, Andrea Arcangeli wrote:

> The other problem that was harming all people is that mmget() was
> incrasing the counter unconditionally. But it could increase it even if it
> was just 0. So we could have an mm with count 1 (increased by /proc), but

Before somebody will notice this: my fix is not the right one (probably
due lack to sleep ;). Incidentally it worked _fine_ because the race is so
fast that the slab cache had not the time to reuse the mm_struct (the one
where I hold the mm_lock even if it's just deallocated) for another task. 

I am fixing it right this time. The fix is to make atomic the tsk->mm =
&init_mm and the mmput.

A real fix is coming soon...

Andrea Arcangeli


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/27
Message-ID: <fa.ikfbtov.f166gf@ifi.uio.no>
X-Deja-AN: 437503944
Original-Date: Wed, 27 Jan 1999 12:41:28 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990127123207.15486A-100000@laser.bogus>
References: <fa.isg3e6v.1802oh4@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 27 Jan 1999, Andrea Arcangeli wrote:

> On Wed, 27 Jan 1999, Andrea Arcangeli wrote:
> 
> > The other problem that was harming all people is that mmget() was
> > incrasing the counter unconditionally. But it could increase it even if it
> > was just 0. So we could have an mm with count 1 (increased by /proc), but
> 
> Before somebody will notice this: my fix is not the right one (probably
> due lack to sleep ;). Incidentally it worked _fine_ because the race is so
> fast that the slab cache had not the time to reuse the mm_struct (the one
> where I hold the mm_lock even if it's just deallocated) for another task. 
> 
> I am fixing it right this time. The fix is to make atomic the tsk->mm =
> &init_mm and the mmput.
> 
> A real fix is coming soon...

The good news is that with the help of Tom Holroyd (that applyed my
sempahore-deadlock-detector and given me back the report) I now know that
the many process in D state that everybody noticed are caused by the
mmget/mmput race too (because the down(&mm->mmap_semp) was done on a
mm_struct that just gone away from some time I think). So this my patch
will fix both the mmap/mmput races and the process stuck in D state
problem. 

Here the trace of the deadlocked process generated by my semaphore
deadlock detector hack (trace produced by Tom): 

[__down_failed+100/184]
[do_bottom_half+124/192]
[generate_oops+0/64]
[do_exit+952/992]
[sprintf+4560/10652]
[get_process_array+96/160]
[get_stat+56/928]
[d_alloc+64/448]
[proc_root_lookup+140/480]
[cached_lookup+28 /160]
[do_follow_link+184/224]
[lookup_dentry+524/704]
[get_process_array+104/160]
[array_read+268/608]
[sys_read+256/320]
[entSys+168/192]

Here the program I writeen last night to reproduce the race. In clean
2.2.0 produce an Oops after max 2 seconds here:

----------------------------------------
#!/bin/sh
# Copyright (C) 1999  Andrea Arcangeli
# crash-2.2.0.sh

while :; do
	free &>/dev/null &
	ps &>/dev/null &
done
----------------------------------------

And here the _real_ fix against clean 2.2.0:

Patch


Tom, could you confirm me that this will fix also the popular process in D
state?

Andrea Arcangeli


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] Fixed the race that was oopsing Linux-2.2.0
Date: 1999/01/27
Message-ID: <fa.nikt8ev.1nnulgh@ifi.uio.no>#1/1
X-Deja-AN: 437616789
Original-Date: Wed, 27 Jan 1999 10:09:07 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990127100753.30467C-100000@penguin.transmeta.com>
References: <fa.ipuq36v.16gg6jj@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Wed, 27 Jan 1999, Andrea Arcangeli wrote:
>
> As first the memcpy in dst (removed from pre9 to 2.2.0 final) is needed
> because while get_stat and get_stats are running the process could gone
> away. If we don't want to memcpy we need the tasklist_lock locked (but
> the memcpy looks more obviously right to me, at least at this clock
> time...).

Don't bother sending me patches that do the old task "grab" operation.
They won't be applied. By applying that patch, you just open yourself up
to overrunning your stack, which is much worse than any normal oops.

Nobody bothered to ask me why that crap was removed from 2.2.0, and
apparently nobody even thought about it.

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] Fixed the race that was oopsing Linux-2.2.0
Date: 1999/01/27
Message-ID: <fa.isvrd6v.19gqphb@ifi.uio.no>
X-Deja-AN: 437675799
Original-Date: Wed, 27 Jan 1999 21:14:35 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990127201355.981A-100000@laser.bogus>
References: <fa.nikt8ev.1nnulgh@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 27 Jan 1999, Linus Torvalds wrote:

> Don't bother sending me patches that do the old task "grab" operation.
> They won't be applied. By applying that patch, you just open yourself up
> to overrunning your stack, which is much worse than any normal oops.

Are you sure you are not been hurted by the munmap bug that was removing
the pgd if no mmap was present and munmap was run? The /proc code looks
like not too much recursive on the stack (and can't be less recursive for
me than for you)... I had never problems here (but maybe I don't have
enough irq handler running at the same time... ;). 

But yes, obviously I agree to not use the stack. So here is the fixed
patch against 2.2.0. I checked and works fine too.

Patch


Andrea Arcangeli



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/27
Message-ID: <fa.j5h4jtv.gkkd9p@ifi.uio.no>#1/1
X-Deja-AN: 437700908
Original-Date: Wed, 27 Jan 1999 21:38:58 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901272138.VAA12114@dax.scot.redhat.com>
References: <fa.imv3s8v.9ha50f@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Original-References: <Pine.LNX.3.96.990127123207.15486A-100...@laser.bogus> 
<Pine.LNX.3.96.990127131315.19147A-100...@laser.bogus>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Wed, 27 Jan 1999 13:15:25 +0100 (CET), Andrea Arcangeli
<and...@e-mind.com> said:

> - * Fixed a race between mmget() and mmput(): added the mm_lock spinlock in
> - * the task_struct to serialize accesses to the tsk->mm field.
> + * Fixed a race between mmget() and mmput(): added the mm_lock spinlock
> + * to serialize accesses to the tsk->mm field.

I don't buy it, because we've seen these on UP machines too.  Besides,
in all of the fork/exit/procfs code paths which look to be relevant to
the reported oopses, we already hold the global kernel lock by the time
we start fiddling with the mm references.  Adding yet another spinlock
should make no difference at all to the locking.

I think the real problem is in the new procfs code in fs/proc/array.c
calling mmget()/mmput() the way it does.  The get_mm_and_lock() function
tries to be safe, but in a couple of places we do not use it, instead
doing something like:

	get_status() {
		tsk = grab_task(pid);
		task_mem() {
			down(&mm->mmap_sem);
			/* Walk the vma tree */
			up(&mm->mmap_sem);
		}
		release_task(tsk);
	}

grab_task() includes

	if (tsk && tsk->mm && tsk->mm != &init_mm)
		mmget(tsk->mm);

and release_task does

	if (tsk != current && tsk->mm && tsk->mm != &init_mm)
		mmput(tsk->mm);

In other words, we are grabbing a task, pinning the mm_struct, blocking
for the mm semaphore and releasing the task's *current* mm_struct, which
is not necessarily the same as it was before we blocked.  In particular,
if we catch the process during a fork+exec, then it is perfectly
possible for tsk->mm to change here.

Note that this was not a problem in the original version of this patch
which just copied the task_struct in grab_task(), because that way we
would always be releasing the same mm that we grabbed.

Can anybody give a good reason why we need to block in array.c at all?
I don't see why we need the down/up(&mm->mmap_sem) code, since we should
already hold the global kernel lock and the vma tree should remain
intact while we inspect it as long as we do not drop that lock.  If we
do keep that lock intact from beginning to end then we cannot run the
risk of tsk->mm changing out from under our feet.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/27
Message-ID: <fa.ntld9ev.1rnekgi@ifi.uio.no>#1/1
X-Deja-AN: 437700910
Original-Date: Wed, 27 Jan 1999 13:45:22 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990127134448.30467Z-100000@penguin.transmeta.com>
References: <fa.j5h4jtv.gkkd9p@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 27 Jan 1999, Stephen C. Tweedie wrote:
> 
> In other words, we are grabbing a task, pinning the mm_struct, blocking
> for the mm semaphore and releasing the task's *current* mm_struct, which
> is not necessarily the same as it was before we blocked.  In particular,
> if we catch the process during a fork+exec, then it is perfectly
> possible for tsk->mm to change here.

Good point. I reverted part of the array.c patches in the released 2.2.0,
it seems I should really have reverted them all.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.iq01d6v.1ag0ph5@ifi.uio.no>#1/1
X-Deja-AN: 437769423
Original-Date: Thu, 28 Jan 1999 02:02:38 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990128001800.399A-100000@laser.bogus>
References: <fa.j5h4jtv.gkkd9p@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 27 Jan 1999, Stephen C. Tweedie wrote:

> > + * Fixed a race between mmget() and mmput(): added the mm_lock spinlock
> > + * to serialize accesses to the tsk->mm field.
> 
> I don't buy it, because we've seen these on UP machines too.  Besides,

Yes the comment/credits are completly bogus. Excuse me, I was too tired to
understand this last night (and btw I thought it was not sure if it was
happening also on UP).

> in all of the fork/exit/procfs code paths which look to be relevant to
> the reported oopses, we already hold the global kernel lock by the time
> we start fiddling with the mm references.  Adding yet another spinlock
> should make no difference at all to the locking.

I think this too. My new code made tons of sense to me and since when I
finished all my work everything become rock solid, I posted the thing to
the list. I don't like lock_kernel(), but yes I noticed too that
lock_kernel() should be enough (I noticed it today, last night I had too
much lack of sleep to think more about it).

> 	get_status() {
> 		tsk = grab_task(pid);
> 		task_mem() {
> 			down(&mm->mmap_sem);

My reason to reinsert the memcpy() was different than your one. It's
because sys_wait4 don't hold the kernel lock and does _only_ a
spin_lock_irq(tasklist_lock) and then remove the process from the
tasklist, so if we don't want to read_lock(tasklist_lock)  all the time in
array.c we must do the copy of the tsk, to be sure that the task_struct we
are playing with, still exists. But holding the tasklist_lock all the time
looked not safe to me due the down() in the array.c code...

Maybe I've thought stupid/wrong things but with my whole patch applyed the
kernel become rock solid and race-free. I'm sure of this. Otherwise I
would have not posted so sure of myself ;).

Can somebody tell me _exactly_ what the mmap_sem stays for? The kernel is
~always doing a down on the mmap_sem of the process itself.  It's
_useless_ that way. The only place the kernel is doing a down on another
task seems ptrace.c and fs/proc/array.c, so does we have the mmap_sem only
for handling correctly such two cases? 

And is true as I think that doing a down(current->mm->mmap_sem) is
_useless_ if every other place on the kernel only does only the same?

And finally I reask your question: can we at any time play with the ->mm,
mm->vma, pgd, pmd, pte, of a process without helding any semaphore, only
having the big kernel lock held?

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.itea3mv.150443l@ifi.uio.no>
X-Deja-AN: 437780451
Original-Date: Thu, 28 Jan 1999 03:50:39 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990128023440.8338A-100000@laser.bogus>
References: <fa.iq01d6v.1ag0ph5@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Do you want to know why last night I added a spinlock around mmget/mmput
without thinking twice?  Simply because mm->count was an atomic_t while it
doesn't need to be an atomic_t in first place.

Obviously if it mm->count need to be an atomic_t, we strictly _need_ also
my new mm_lock spinlock in 2.2.1.

So you don't buy my code, but now, I don't buy both all /proc mmget/mmput
sutff and the mm->count atomic_t.

I also avoided all the mmget/mmput stuff in /proc since it _has_ to run
just serialized by the mmget/mmput semaphore (otherwise we need
my mm_lock).

I also removed the semaphore stuff because it seems to me (and to you btw
;) that _all_ places in the mm that does a down(current->mm->mmap_sem),
does always then a lock_kernel().

I also removed all the memcpy, we only need the read_lock(tasklist_lock)
held in SMP because otherwise wait4() could remove the stack of the
process under our eyes as just pointed out in the last email.

Not doing in 2.2.1 the mm->count s/atomic_t/int/ due worry of races will
mean that array.c in 2.2.1 will be not safe enough without my mm_lock
spinlock. Do you understand my point?

I rediffed everything against clean 2.2.0, and I am now running it without
any problem, seems rock solid from a /proc race point of view.

Patch

If you see a race in this my new patch, please let me know and probably
you'll give me a good reason to reinsert mm_lock ;) 

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.j310kuv.j4gfps@ifi.uio.no>#1/1
X-Deja-AN: 437955781
Original-Date: Thu, 28 Jan 1999 15:05:41 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901281505.PAA02861@dax.scot.redhat.com>
References: <fa.iq01d6v.1ag0ph5@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Original-References: <199901272138.VAA12...@dax.scot.redhat.com> 
<Pine.LNX.3.96.990128001800.399A-100...@laser.bogus>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Thu, 28 Jan 1999 02:02:38 +0100 (CET), Andrea Arcangeli
<and...@e-mind.com> said:

> I think this too. My new code made tons of sense to me and since when I
> finished all my work everything become rock solid, 

Luck!  Fwiw, I was unable to reproduce the problem at all even on 2.2.0
with your test script.

>> get_status() {
>> tsk = grab_task(pid);
>> task_mem() {
>> down(&mm->mmap_sem);

No spinlock change will fix this race (although the memcpy will).

> My reason to reinsert the memcpy() was different than your one. It's
> because sys_wait4 don't hold the kernel lock and does _only_ a
> spin_lock_irq(tasklist_lock) and then remove the process from the
> tasklist, 

OK.

> Maybe I've thought stupid/wrong things but with my whole patch applyed the
> kernel become rock solid and race-free. I'm sure of this. Otherwise I
> would have not posted so sure of myself ;).

No, because the fork/exec race is still obviously present in get_stat,
and because SMP-only synchronisation fixes cannot fix a problem seen on
UP machines.

> Can somebody tell me _exactly_ what the mmap_sem stays for? 

Any modifications or blocking lookups to the mmap structures, and all
places where we add a new page into the process page tables.  That
includes mmap operations and page faults.

> The kernel is ~always doing a down on the mmap_sem of the process
> itself.  It's _useless_ that way.  The only place the kernel is doing
> a down on another task seems ptrace.c and fs/proc/array.c, so does we
> have the mmap_sem only for handling correctly such two cases?

Not at all.  The whole point of the semaphore is to protect the shared
mm when we have multiple threads all mmaping and page faulting
independently within the same mm_struct.  That's why the semaphore is in
the mm_struct, not in the task_struct.

> And finally I reask your question: can we at any time play with the ->mm,
> mm-> vma, pgd, pmd, pte, of a process without helding any semaphore, only
> having the big kernel lock held?

Yes.  Basically, changing an existing pte needs the kernel lock.  Adding
or modifying (but not removing) a new pte or modifying the vma tree
needs the mm semaphore.  We can unmap ptes in swap_out() without the
semaphore but with only the kernel lock.  We can map new anonymous pages
without the kernel lock but with only the semaphore.  Everything else
needs both.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.j410kuv.i4gfpu@ifi.uio.no>#1/1
X-Deja-AN: 437959449
Original-Date: Thu, 28 Jan 1999 15:09:19 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901281509.PAA02883@dax.scot.redhat.com>
References: <fa.itea3mv.150443l@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Original-References: <Pine.LNX.3.96.990128001800.399A-100...@laser.bogus> 
<Pine.LNX.3.96.990128023440.8338A-100...@laser.bogus>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Thu, 28 Jan 1999 03:50:39 +0100 (CET), Andrea Arcangeli
<and...@e-mind.com> said:

> Do you want to know why last night I added a spinlock around mmget/mmput
> without thinking twice?  Simply because mm->count was an atomic_t while it
> doesn't need to be an atomic_t in first place.

Agreed.

> So you don't buy my code, but now, I don't buy both all /proc mmget/mmput
> sutff and the mm->count atomic_t.

Agreed.  mm->count might as well remain atomic because that will help
when we come to apply finer grained locking to the mm, but as far as I
am concerned we may as well drop pretty much all of the mmget/mmput
stuff.  The only race I can still see is the possibility of sys_wait*
removing the task struct while we run on SMP.

> I also removed all the memcpy, we only need the read_lock(tasklist_lock)
> held in SMP because otherwise wait4() could remove the stack of the
> process under our eyes as just pointed out in the last email.

Yep, fine, as long as we keep the tasklist_lock right until the end of
our use of the task struct.

> Not doing in 2.2.1 the mm->count s/atomic_t/int/ due worry of races will
> mean that array.c in 2.2.1 will be not safe enough without my mm_lock
> spinlock. Do you understand my point?

No, because we already have a sufficient spinlock to protect us.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.nhlf7uv.1lnckgm@ifi.uio.no>#1/1
X-Deja-AN: 437999350
Original-Date: Thu, 28 Jan 1999 09:36:11 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990128093033.32418B-100000@penguin.transmeta.com>
References: <fa.itea3mv.150443l@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Thu, 28 Jan 1999, Andrea Arcangeli wrote:
>
> Do you want to know why last night I added a spinlock around mmget/mmput
> without thinking twice?  Simply because mm->count was an atomic_t while it
> doesn't need to be an atomic_t in first place.

No. Your argument does not make any sense at all.

Go away, this is not the time to use magic to make kernel patches.

A "atomic_t" _often_ makes sense without having any spinlocks,
_especially_ when used as a reference count. Let me count the ways:

 - when you increase the count because you make a copy, you know that
   you're already a holder of the count, so you don't need any spinlocks
   to protect anything else: you _know_ the area is there.

 - when you decrease the count, you use "atomic_dec_and_test()" because
   you know that the count was > 0, and you know that only _one_ such
   decrementer will get a positive reply for the atomic_dec_and_test. The
   one that is successful doesn't need any locking, because he's now the
   only owner, and should just release everything. 

In short, it has to be atomic, and spinlocks never enter the picture at
all.

Your patches do not make sense. Period. The only thing that makes sense is
to revert the array.c thing to the old one that didn't try to be clever.

		Linus 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.nk5n8ev.1k74l0n@ifi.uio.no>#1/1
X-Deja-AN: 438007402
Original-Date: Thu, 28 Jan 1999 09:54:07 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990128095147.32418F-100000@penguin.transmeta.com>
References: <fa.j410kuv.i4gfpu@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:

> 
> > Do you want to know why last night I added a spinlock around mmget/mmput
> > without thinking twice?  Simply because mm->count was an atomic_t while it
> > doesn't need to be an atomic_t in first place.
> 
> Agreed.

Incorrect, see my previous email. It may not be strictly necessary right
now due to us probably holding the kernel lock everywhere, but it is
conceptually necessary, and it is _not_ an argument for a spinlock.

The /proc code has to be fixed, but the easy fix is to just revert to the
old one as far as I can see. I shouldn't have accepted the /proc patches
in the first place, and I'm sorry I did.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.j6hejmv.nked1q@ifi.uio.no>#1/1
X-Deja-AN: 438014626
Original-Date: Thu, 28 Jan 1999 18:07:57 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901281807.SAA03328@dax.scot.redhat.com>
References: <fa.nk5n8ev.1k74l0n@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <199901281509.PAA02...@dax.scot.redhat.com> 
<Pine.LNX.3.95.990128095147.32418F-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Thu, 28 Jan 1999 09:54:07 -0800 (PST), Linus Torvalds
<torva...@transmeta.com> said:

> On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:

>> > Do you want to know why last night I added a spinlock around mmget/mmput
>> > without thinking twice?  Simply because mm->count was an atomic_t while it
>> > doesn't need to be an atomic_t in first place.
>> Agreed.

> Incorrect, see my previous email. It may not be strictly necessary right
> now due to us probably holding the kernel lock everywhere, but it is
> conceptually necessary, and it is _not_ an argument for a spinlock.

Linus, we are in violent agreement: see my previous email. :)  I agree
with both you and Andrea that the atomic_t is not strictly necessary,
and agree vigorously that removing it is wrong because it will just make
the job of fine-graining the locking ever more harder.  As we relax the
kernel locks, the atomic_t becomes more and more important.

> The /proc code has to be fixed, but the easy fix is to just revert to the
> old one as far as I can see. I shouldn't have accepted the /proc patches
> in the first place, and I'm sorry I did.

Yep, but Andrea did point out what looks like at least one valid race:
sys_wait* on a zombie task can remove and deallocate the task_struct
without taking the global lock.  Reverting the diff is the right thing
for 2.2.1, but once we've done that we may need to look at either
keeping the task lock until we have finished with the task_struct in
array.c, or doing a memcpy on the task while we still have it locked.
That does seem to be a valid fix.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.nl4n7ev.1g7ok0k@ifi.uio.no>#1/1
X-Deja-AN: 438014628
Original-Date: Thu, 28 Jan 1999 10:17:37 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990128101220.32418I-100000@penguin.transmeta.com>
References: <fa.j6hejmv.nked1q@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:
> 
> Yep, but Andrea did point out what looks like at least one valid race:
> sys_wait* on a zombie task can remove and deallocate the task_struct
> without taking the global lock.  Reverting the diff is the right thing
> for 2.2.1, but once we've done that we may need to look at either
> keeping the task lock until we have finished with the task_struct in
> array.c, or doing a memcpy on the task while we still have it locked.
> That does seem to be a valid fix.

I'd much rather just use some stale "struct task_struct" data.

What we _might_ do in /proc, is to just increment the usage count for the
(double) page that contains the task structure, so that even if a
release() does happen while we're using the page, the page won't get
re-used, and we can essentially look at all the info as it was when it was
released. 

Note that by the time it has become a zombie, it will have released all
the mm stuff anyway, so we don't even have any dangerous stale mm pointers
that we'd follow: we'd only be looking at that one structure. 

Voila, no memcpy(), no silly locks, no problems. 

Anyway, for 2.2.1 I don't even want to be clever. As it is (with the bogus
array.c race "fixes" removed), the page may get freed without any kernel
lock, and we may return _completely_ bogus information, but that is (a) 
extremely unlikely in the first place and (b) basically harmless and
pretty much impossible to exploit. 

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.j51ekdv.h4ec9o@ifi.uio.no>#1/1
X-Deja-AN: 438022252
Original-Date: Thu, 28 Jan 1999 18:25:08 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-Id: <199901281825.SAA03425@dax.scot.redhat.com>
References: <fa.nl4n7ev.1g7ok0k@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Original-References: <199901281807.SAA03...@dax.scot.redhat.com> 
<Pine.LNX.3.95.990128101220.32418I-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Thu, 28 Jan 1999 10:17:37 -0800 (PST), Linus Torvalds
<torva...@transmeta.com> said:

> I'd much rather just use some stale "struct task_struct" data.

The problem isn't the risk of using stale data: it is the risk of using
complete garbage if the task_struct page gets reused.  The procfs code
does check that tsk->mm is non-zero before following the pointers, but
if there is a non-zero address there then it _will_ be dereferenced
regardless.

> What we _might_ do in /proc, is to just increment the usage count for the
> (double) page that contains the task structure,

That would certainly take care of it.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.ob9le8v.jke0ob@ifi.uio.no>#1/1
X-Deja-AN: 438033828
Original-Date: Thu, 28 Jan 1999 11:11:49 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990128110737.6130B-100000@penguin.transmeta.com>
References: <fa.k1tlmtv.1b46a3f@ifi.uio.no>
To: Alan Cox <a...@lxorguk.ukuu.org.uk>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Thu, 28 Jan 1999, Alan Cox wrote:
> 
> (c) you can check if the thing has disappeared after using it and clear
> the buffer if so.

Yes, but the problem is that when there _is_ stale data (unlikely), you
can actually do the wrong thing before.

Incrementing the page counter should fix all problems, but it's such a
subtle fix (even though it's essentially just a few one-liners) that I'm
not going to do it for 2.2.1, which I want to get out later today. 

Alan, the only patch I don't have that looks likely for 2.2.1 is the
IDE-SCSI thing. Did you have that somewhere?

Right now my 2.2.1 patches are:
 - the stupid off-by-one bug found by Ingo
 - __down_interruptible on alpha
 - move "esstype" to outside a #ifdef MODULE
 - NFSD rename/rmdir fixes
 - revert to old array.c
 - change comment about __PAGE_OFFSET
 - missing "vma = NULL" case for avl find_vma()

Holler now or forever hold your peace, because I'd like to get the thing
out in a few hours. 

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.irgjdmv.1b0aph8@ifi.uio.no>#1/1
X-Deja-AN: 438098542
Original-Date: Thu, 28 Jan 1999 23:33:03 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990128232118.797B-100000@laser.bogus>
References: <fa.j6hejmv.nked1q@ifi.uio.no>
To: "Stephen C. Tweedie" <s...@redhat.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:

> Linus, we are in violent agreement: see my previous email. :)  I agree
> with both you and Andrea that the atomic_t is not strictly necessary,
> and agree vigorously that removing it is wrong because it will just make
> the job of fine-graining the locking ever more harder.  As we relax the
> kernel locks, the atomic_t becomes more and more important.

I'm afraid. I still think that it will never be needed even removing
lock_kernel(), because doing that we would need to make atomic the
decreasing of mm->count with current->mm = &init_mm, otherwise we would
not know if we can touch the current->mm of a process at any time. 

It's far from be like the semaphore case where we avoid the spinlock to
not race because we can order our moves differently if we are in down() or
in up(). 

But maybe I am missing something, but it's what I think though.

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.iogfefv.1e0io93@ifi.uio.no>#1/1
X-Deja-AN: 438114655
Original-Date: Thu, 28 Jan 1999 23:04:42 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990128205918.438B-100000@laser.bogus>
References: <fa.nk5n8ev.1k74l0n@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 28 Jan 1999, Linus Torvalds wrote:

> Incorrect, see my previous email. It may not be strictly necessary right
> now due to us probably holding the kernel lock everywhere, but it is
> conceptually necessary, and it is _not_ an argument for a spinlock.

If you remove the kernel lock around do_exit() you _need_ my mm_lock
spinlock. You need it to make atomic the decreasing of mm->count and
current->mm = &init_mm. If the two instructions are not atomic you have
_no_ way to know if you can mmget() at any time the mm of a process.

I repeat in another way (just trying to avoid English mistakes): 
decreasing mm->count has to go in sync with updating current->mm,
otherwise you don't know if you can access the mm of a process (and so
also touching mm->count) because the mm of such process could be just been
deallocated (the process could be a zombie).

As far I can see the mm->count will _never_ have 1 reason to be atomic_t
even removing the function lock_kernel() from
/usr/src/linux/include/asm/smplock.h . 

Using an int for mm->count (isntead of atomic_t) is far from be something
of magic. Doing that change it means that I don't expect work by magic,
but I know that it will work fine. It's instead magic (out of the human
reason) how the kernel works now, if you think that we _need_ atomic_t
instead of int. 

It's hard for me to be quiet even if you asked me to go away because I
think you are plain wrong telling that we need mm->count atomic_t for the
future. 

Andrea Arcangeli

PS. You have not hurted me asking me to go away, because I _always_ do the
_best_ I _can_, and if my best is plain wrong I can't change this reality. 
What I care is to do the best I can. Maybe I should really go away and do
something in my sparetime where it's not requested to be clever... I hope
it's not the case though and that you wasn't serious.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/28
Message-ID: <fa.lfrtuiv.d1m6i5@ifi.uio.no>#1/1
X-Deja-AN: 438114658
Original-Date: Thu, 28 Jan 1999 14:53:15 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990128144808.956H-100000@penguin.transmeta.com>
References: <fa.irgjdmv.1b0aph8@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 28 Jan 1999, Andrea Arcangeli wrote:
> 
> I'm afraid. I still think that it will never be needed even removing
> lock_kernel(), because doing that we would need to make atomic the
> decreasing of mm->count with current->mm = &init_mm, otherwise we would
> not know if we can touch the current->mm of a process at any time. 

What?

We can _always_ touch current->mm, because it can _never_ go away from
under us: it will always have a non-zero count exactly because "current"
has a copy of it.

If you want to touch some _other_ process mm pointer, that's when it gets
interesting. Buyer beware.

> But maybe I am missing something, but it's what I think though.

You're missing the fact that whenever we own the mm, we know that NOBODY
else can do anything bad at all, because nobody else can ever think that
the count is zero. Because we hold a reference count to it.

The _only_ interesting case is when multiple threads do "exit()" at the
same time, and then one of them will be the last one - and that last one
will _know_ he is the last one because that's how atomic_dec_and_tes()
works. And once he knows, he no longer has to care about anybody else,
because he knows that nobody else can touch that mm space any more (or it
wouldn't have decremented the counter to zero in the first place). 

So not only does mm->count need to be atomic in the absense of other
locks, but that atomicity is obviously sufficient for allocation purposes. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.lhbru2v.ch87ib@ifi.uio.no>#1/1
X-Deja-AN: 438138149
Original-Date: Thu, 28 Jan 1999 16:17:42 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990128161635.956I-100000@penguin.transmeta.com>
References: <fa.iogfefv.1e0io93@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Thu, 28 Jan 1999, Andrea Arcangeli wrote:
>
> If you remove the kernel lock around do_exit() you _need_ my mm_lock
> spinlock. You need it to make atomic the decreasing of mm->count and
> current->mm = &init_mm. If the two instructions are not atomic you have
> _no_ way to know if you can mmget() at any time the mm of a process.

Andrea, just go away.

The two do not _have_ to be atomic, they never had to, and they never
_will_ have to be atomic. You obviously haven't read all my email
explaining why they don't have to be atomic.

> I repeat in another way (just trying to avoid English mistakes): 
> decreasing mm->count has to go in sync with updating current->mm,

No it has not.

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.iuuq3uv.16g06bm@ifi.uio.no>#1/1
X-Deja-AN: 438264947
Original-Date: Fri, 29 Jan 1999 02:47:41 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990129015657.8557A-100000@laser.bogus>
References: <fa.lfrtuiv.d1m6i5@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 28 Jan 1999, Linus Torvalds wrote:

> If you want to touch some _other_ process mm pointer, that's when it gets
> interesting. Buyer beware.

Infact this is the point. I really think you are missing something. I read
your explanation of why we only need atomic_t but it was not touching some
point I instead thought about.

Ok, I assume you are right. Please take this example: I am writing a nice
kernel module that will collect some nice stats from the kernel.

I don't have the big kernel lock held, I consider that also __exit_mm()
won't have the big kernel lock held.

To collect such stats I need to touch all mm_struct pointed by all tsk->mm
in the kernel.

How can do that without racing? (see below)

> You're missing the fact that whenever we own the mm, we know that NOBODY

Sure, once we have owned an mm with mmget() we are safe, but my __only__
point is: how can I run a mmget() being sure that I am _getting_ an mm
that is not just been deallocated from the process that was owning such mm
(and it was the _only_ onwer) that get killed in the meantime on the other
CPU? Now I am sure because __exit_mm() is running with the kernel lock
held.

Another way to tell the same: "how can I be sure that I am doing an
atomic_inc(&mm->count) on a mm->count that was just > 0, and more
important on an mm that is still allocated? "

Now everything is fine of course, because I only need to get the big
kernel lock (and for this reason atomic_t _right_now_ is not needed) but
as you pointed out, some time in the future we could kill the big kernel
lock to scale better in SMP. OK? At that time according to you we'll
_need_ (and we'll _only_ need) the mm->count atomic_t to play with the mm
of other process without risk of races. 

I return to the kernel module stat colletctor example:

To be sure that the kernel stack of the process will not go away under me
I need to held the tasklist_lock. Ok? So i'll do:

	read_lock(&tasklist_lock);
	tsk = find_task_by_pid(pid);
	if (tsk)
	{
		struct page * page = mem_map + MAP_NR(tsk);
		atomic_inc(&page->count);
	}
	read_unlock(&tasklist_lock);
	mdelay(10000000000000);

So now I can wait all time I want and nobody can free and reuse my task
struct and replace it with garbage under my eyes. OK?

Now I want to play with the tsk->mm of the tsk. OK?

I'll do:

	unlock_kernel();
	^^
	if (tsk->mm && tsk->mm != &init_mm)
	{
		mdelay(2000000000000000000);
		mmget();
	}

but between the check `tsk->mm != &init_mm' and mmget() in the other CPU
the task `tsk' could have run __exit_mm()!!! (you remeber, we don't have
the big kernel lock held in __exit_mm() anymore). So mmget() will reput
the mm->count to 1, but there won't be any kind of mm_struct there!!! We
are increasing a random kernel space doing atomic_inc(&mm->count)!!
The mm_sruct is gone away between tsk->mm != &init_mm and mmget(). 

This is the race I seen two night ago. I seen it because I seen some
report on the list that was Oopsing in kmem_cache_free() called by
mmput(). So I didn't thought two times before adding the mm_lock. Do you
see why now?  Obviously I didn't seen the lock_kernel() (or better I seen
it too late and I am been too lazy to remove the mm_lock and since it was
not harming I left it there, as an advance for the future). 

An Oops in mmput() could be explained very well by the race I am trying to
point out to you:

	task 1				task 2
	-----------			---------------
					mm->count == 1
	if (tsk->mm != &init_mm)
	{
		/* interrupt */
					__do_exit()
					{
						tsk->mm = &init_mm
						mmput(); (mm->count == 0)
					}
		mmget() (mm->count == 1 but there' isn't a mm_struct there!!!)
	}

I can't see how _only_ with the mm->count atomic_t we can avoid this
scenario (you said that in the last emails!!!!). Maybe I only need a
loooong sleep.

Excuse me if I am still here (not gone away yet) but such mmget()/mmput() 
race sound not fixable to me without using a mm_lock spinlock (I am
considering to not have the lock_kernel() in __exit_mm() of course).

Do you see the future race now? Do you see why I tell you that we can't be
race free in looking the mm of _another_ task without a spinlock if
__exit_mm() won't own the big kernel lock?

Obviously if we'll serialize mmput/mmget and the setting of the tsk->mm
using a spinlock, we won't need mm->count atomic_t (as now btw).

Do you agree with me now?

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: MOLNAR Ingo <mi...@chiara.csoma.elte.hu>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.nh9l16v.1ikaj8i@ifi.uio.no>#1/1
X-Deja-AN: 438297317
Original-Date: Fri, 29 Jan 1999 12:20:43 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990129120839.22453C-100000@chiara.csoma.elte.hu>
References: <fa.iuuq3uv.16g06bm@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 29 Jan 1999, Andrea Arcangeli wrote:

> Another way to tell the same: "how can I be sure that I am doing an
> atomic_inc(&mm->count) on a mm->count that was just > 0, and more
> important on an mm that is still allocated? "

you are misunderstanding how atomic_inc_and_test() works. The processor
guarantees this. This is the crux of SMP atomic operations. How otherwise
could we reliably build read-write spinlocks.

yes, there is no atomic_inc_and_test() yet. (it's a bit tricky to
implement but pretty much analogous to read-write locks, we probably need
to shift values down by one to get the 'just increased from -1 to 0' event
via the zero flag, and get the 'just decreased from 0 to -1' event via the
sign flag.) Also note that this is all fiction yet because we _are_
holding the kernel lock for these situations in 2.2.

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.ingjduv.1f0aopd@ifi.uio.no>#1/1
X-Deja-AN: 438311254
Original-Date: Fri, 29 Jan 1999 13:08:28 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990129124407.639A-100000@laser.bogus>
References: <fa.nh9l16v.1ikaj8i@ifi.uio.no>
To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: Andrea Arcangeli <and...@e-mind.com>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 29 Jan 1999, MOLNAR Ingo wrote:

> yes, there is no atomic_inc_and_test() yet. (it's a bit tricky to

_Where_ do you want to run atomic_inc_and_test()? On random kernel data
where incidentally a long time ago there was the just deallocated
and reused (somewhere else) mm_struct?

If we incidentally access the mm_struct and we notice that the mm->count
is zero it menas we are just buggy. 

> sign flag.) Also note that this is all fiction yet because we _are_
> holding the kernel lock for these situations in 2.2.

Sure it's finction, but _all_ complains I get against my s/atomic_t/int/
was about the future.

Andrea Arcangeli


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: MOLNAR Ingo <mi...@chiara.csoma.elte.hu>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.ng9f1mv.1gkgjov@ifi.uio.no>#1/1
X-Deja-AN: 438326114
Original-Date: Fri, 29 Jan 1999 14:19:14 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990129132505.22453D-100000@chiara.csoma.elte.hu>
References: <fa.ingjduv.1f0aopd@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Reply-To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu>
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 29 Jan 1999, Andrea Arcangeli wrote:

> _Where_ do you want to run atomic_inc_and_test()? On random kernel data

note that in 99% of the cases we need the counter only in the clone(),
exec() and exit() path, for these three cases we know implicitly that it's
a valid buffer. (because we hold a reference to it) [subsequently we dont
need any atomic_inc_and_test thing either for clone+exec+exit] An atomic
counter is just about perfect for those uses, even in the 'no kernel lock'
case. 

I only looked at your last (fairly large) patch, which does not seem to
have _any_ effect at all as far as bugfixes are concerned, except the
array.c change (which again turned out to have nothing to do with the
atomic vs. spinlock thing). 

for 'other process' uses (for cases when we want to dereference an
mm_struct pointer but do not hold a reference to it, eg. /proc or the VM
scanning logic) we will have to do something smart when we remove the
kernel lock in 2.3, but it should not increase the cost of the common case
(clone() + exit()) if possible.

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.is0bdvv.18guopf@ifi.uio.no>#1/1
X-Deja-AN: 438339708
Original-Date: Fri, 29 Jan 1999 15:14:37 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990129144844.886A-100000@laser.bogus>
References: <fa.ng9f1mv.1gkgjov@ifi.uio.no>
To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Fri, 29 Jan 1999, MOLNAR Ingo wrote:

> note that in 99% of the cases we need the counter only in the clone(),
> exec() and exit() path, for these three cases we know implicitly that it's
> a valid buffer. (because we hold a reference to it) [subsequently we dont
> need any atomic_inc_and_test thing either for clone+exec+exit] An atomic
> counter is just about perfect for those uses, even in the 'no kernel lock'
> case. 

Sure, if you look at my last email to Linus, you'll see that I am _only_
talking about getting the mm of a random process (not the current one!).

Probably I should comment better my patches, but I have really a big
leakage of spare time these days but I _don't_ want to decrease the kernel
hacking (even if Linus asked me to go away two times).

> I only looked at your last (fairly large) patch, which does not seem to
> have _any_ effect at all as far as bugfixes are concerned, except the
> array.c change (which again turned out to have nothing to do with the
> atomic vs. spinlock thing).

Infact, the only bugfix is array.c (as I just pointed out clearly in
bugtraq). Another reason I didn't either checked about lock_kernel() two
night ago before adding mm_lock, is that it was very late, I had a little
time and I seen some bugreport on the list that was oopsing in mmput (so I
thought "yeah, I seen the race!", I thought it was the mmput/current->mm =
&init_mm race). Then I also seen the mm->count as atomic_t so I thought it
was really the case. 

Ah note, there was overhead also in the overhead since checking for
mm->count == 0 in mmget was a nono ;).

> for 'other process' uses (for cases when we want to dereference an
> mm_struct pointer but do not hold a reference to it, eg. /proc or the VM
> scanning logic) we will have to do something smart when we remove the
> kernel lock in 2.3, but it should not increase the cost of the common case
> (clone() + exit()) if possible.

Ok this is a completly different story and I am the first that don't want
to continue it now, but I want to point out that atomic_t is __far__ to be
the only thing we'll nee to make the mm_struct browsing race free (as I
understood from Linus's word).

I am very very happy to see that I was not (completly ;) crazy.

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses 
[Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
Date: 1999/01/29
Message-ID: <fa.njl97ev.1mnik0k@ifi.uio.no>#1/1
X-Deja-AN: 438425282
Original-Date: Fri, 29 Jan 1999 10:24:16 -0800 (PST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.990129101917.12610G-100000@penguin.transmeta.com>
References: <fa.iuuq3uv.16g06bm@ifi.uio.no>
To: Andrea Arcangeli <and...@e-mind.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



On Fri, 29 Jan 1999, Andrea Arcangeli wrote:
> 
> > If you want to touch some _other_ process mm pointer, that's when it gets
> > interesting. Buyer beware.
> 
> Infact this is the point. I really think you are missing something. I read
> your explanation of why we only need atomic_t but it was not touching some
> point I instead thought about.
> 
> Ok, I assume you are right. Please take this example: I am writing a nice
> kernel module that will collect some nice stats from the kernel.

And that's where you have problems. You shouldn't do that, and that's why
/proc is such a nasty beast right now. 

If you want to look at other peoples processes, then the onus should be on
_you_ to do all the extra crap that normal processes do not need to do. 
That extra crap can be a number of things, but you shouldn't penalize the
normal path (which is to touch only your own mm space). 

For example, the thing I suspect we'll have to do in the long run for
/proc is:

 - get the process while holding the tasklist lock, and increment the page
   count so that even if the process exists, the page does not get unused.
 - get the kernel lock. Now we know that we're atomic with regard to
   __exit_mm()
 - look at tsk->mm: it it is not init_mm, you're now safe, because
   __exit_mm is called only with the kernel lock held. 

See? No spinlocks.

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/