From: Andrea Arcangeli <and...@e-mind.com> Subject: [patch] Fixed the race that was oopsing Linux-2.2.0 Date: 1999/01/27 Message-ID: <fa.ipuq36v.16gg6jj@ifi.uio.no> X-Deja-AN: 437378208 Original-Date: Wed, 27 Jan 1999 04:02:09 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990127035109.4014A-100000@laser.bogus> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu As first the memcpy in dst (removed from pre9 to 2.2.0 final) is needed because while get_stat and get_stats are running the process could gone away. If we don't want to memcpy we need the tasklist_lock locked (but the memcpy looks more obviously right to me, at least at this clock time...). The other problem that was harming all people is that mmget() was incrasing the counter unconditionally. But it could increase it even if it was just 0. So we could have an mm with count 1 (increased by /proc), but just dealloced by do_exit() because do_exit run between the check for mm != &init_mm and mmget() (in get_mm_and_lock() if I remeber well). Here the complete fix against 2.2.0 (the fix will be in 2.2.0_arca-3 soon): Patch Now it's really time to sleep for me.... Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] Fixed the race that was oopsing Linux-2.2.0 Date: 1999/01/27 Message-ID: <fa.isg3e6v.1802oh4@ifi.uio.no>#1/1 X-Deja-AN: 437482814 Original-Date: Wed, 27 Jan 1999 11:58:34 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990127115458.181A-100000@laser.bogus> References: <fa.ipuq36v.16gg6jj@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 27 Jan 1999, Andrea Arcangeli wrote: > The other problem that was harming all people is that mmget() was > incrasing the counter unconditionally. But it could increase it even if it > was just 0. So we could have an mm with count 1 (increased by /proc), but Before somebody will notice this: my fix is not the right one (probably due lack to sleep ;). Incidentally it worked _fine_ because the race is so fast that the slab cache had not the time to reuse the mm_struct (the one where I hold the mm_lock even if it's just deallocated) for another task. I am fixing it right this time. The fix is to make atomic the tsk->mm = &init_mm and the mmput. A real fix is coming soon... Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/27 Message-ID: <fa.ikfbtov.f166gf@ifi.uio.no> X-Deja-AN: 437503944 Original-Date: Wed, 27 Jan 1999 12:41:28 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990127123207.15486A-100000@laser.bogus> References: <fa.isg3e6v.1802oh4@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 27 Jan 1999, Andrea Arcangeli wrote: > On Wed, 27 Jan 1999, Andrea Arcangeli wrote: > > > The other problem that was harming all people is that mmget() was > > incrasing the counter unconditionally. But it could increase it even if it > > was just 0. So we could have an mm with count 1 (increased by /proc), but > > Before somebody will notice this: my fix is not the right one (probably > due lack to sleep ;). Incidentally it worked _fine_ because the race is so > fast that the slab cache had not the time to reuse the mm_struct (the one > where I hold the mm_lock even if it's just deallocated) for another task. > > I am fixing it right this time. The fix is to make atomic the tsk->mm = > &init_mm and the mmput. > > A real fix is coming soon... The good news is that with the help of Tom Holroyd (that applyed my sempahore-deadlock-detector and given me back the report) I now know that the many process in D state that everybody noticed are caused by the mmget/mmput race too (because the down(&mm->mmap_semp) was done on a mm_struct that just gone away from some time I think). So this my patch will fix both the mmap/mmput races and the process stuck in D state problem. Here the trace of the deadlocked process generated by my semaphore deadlock detector hack (trace produced by Tom): [__down_failed+100/184] [do_bottom_half+124/192] [generate_oops+0/64] [do_exit+952/992] [sprintf+4560/10652] [get_process_array+96/160] [get_stat+56/928] [d_alloc+64/448] [proc_root_lookup+140/480] [cached_lookup+28 /160] [do_follow_link+184/224] [lookup_dentry+524/704] [get_process_array+104/160] [array_read+268/608] [sys_read+256/320] [entSys+168/192] Here the program I writeen last night to reproduce the race. In clean 2.2.0 produce an Oops after max 2 seconds here: ---------------------------------------- #!/bin/sh # Copyright (C) 1999 Andrea Arcangeli # crash-2.2.0.sh while :; do free &>/dev/null & ps &>/dev/null & done ---------------------------------------- And here the _real_ fix against clean 2.2.0: Patch Tom, could you confirm me that this will fix also the popular process in D state? Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] Fixed the race that was oopsing Linux-2.2.0 Date: 1999/01/27 Message-ID: <fa.nikt8ev.1nnulgh@ifi.uio.no>#1/1 X-Deja-AN: 437616789 Original-Date: Wed, 27 Jan 1999 10:09:07 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990127100753.30467C-100000@penguin.transmeta.com> References: <fa.ipuq36v.16gg6jj@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 27 Jan 1999, Andrea Arcangeli wrote: > > As first the memcpy in dst (removed from pre9 to 2.2.0 final) is needed > because while get_stat and get_stats are running the process could gone > away. If we don't want to memcpy we need the tasklist_lock locked (but > the memcpy looks more obviously right to me, at least at this clock > time...). Don't bother sending me patches that do the old task "grab" operation. They won't be applied. By applying that patch, you just open yourself up to overrunning your stack, which is much worse than any normal oops. Nobody bothered to ask me why that crap was removed from 2.2.0, and apparently nobody even thought about it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] Fixed the race that was oopsing Linux-2.2.0 Date: 1999/01/27 Message-ID: <fa.isvrd6v.19gqphb@ifi.uio.no> X-Deja-AN: 437675799 Original-Date: Wed, 27 Jan 1999 21:14:35 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990127201355.981A-100000@laser.bogus> References: <fa.nikt8ev.1nnulgh@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 27 Jan 1999, Linus Torvalds wrote: > Don't bother sending me patches that do the old task "grab" operation. > They won't be applied. By applying that patch, you just open yourself up > to overrunning your stack, which is much worse than any normal oops. Are you sure you are not been hurted by the munmap bug that was removing the pgd if no mmap was present and munmap was run? The /proc code looks like not too much recursive on the stack (and can't be less recursive for me than for you)... I had never problems here (but maybe I don't have enough irq handler running at the same time... ;). But yes, obviously I agree to not use the stack. So here is the fixed patch against 2.2.0. I checked and works fine too. Patch Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/27 Message-ID: <fa.j5h4jtv.gkkd9p@ifi.uio.no>#1/1 X-Deja-AN: 437700908 Original-Date: Wed, 27 Jan 1999 21:38:58 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901272138.VAA12114@dax.scot.redhat.com> References: <fa.imv3s8v.9ha50f@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Original-References: <Pine.LNX.3.96.990127123207.15486A-100...@laser.bogus> <Pine.LNX.3.96.990127131315.19147A-100...@laser.bogus> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Wed, 27 Jan 1999 13:15:25 +0100 (CET), Andrea Arcangeli <and...@e-mind.com> said: > - * Fixed a race between mmget() and mmput(): added the mm_lock spinlock in > - * the task_struct to serialize accesses to the tsk->mm field. > + * Fixed a race between mmget() and mmput(): added the mm_lock spinlock > + * to serialize accesses to the tsk->mm field. I don't buy it, because we've seen these on UP machines too. Besides, in all of the fork/exit/procfs code paths which look to be relevant to the reported oopses, we already hold the global kernel lock by the time we start fiddling with the mm references. Adding yet another spinlock should make no difference at all to the locking. I think the real problem is in the new procfs code in fs/proc/array.c calling mmget()/mmput() the way it does. The get_mm_and_lock() function tries to be safe, but in a couple of places we do not use it, instead doing something like: get_status() { tsk = grab_task(pid); task_mem() { down(&mm->mmap_sem); /* Walk the vma tree */ up(&mm->mmap_sem); } release_task(tsk); } grab_task() includes if (tsk && tsk->mm && tsk->mm != &init_mm) mmget(tsk->mm); and release_task does if (tsk != current && tsk->mm && tsk->mm != &init_mm) mmput(tsk->mm); In other words, we are grabbing a task, pinning the mm_struct, blocking for the mm semaphore and releasing the task's *current* mm_struct, which is not necessarily the same as it was before we blocked. In particular, if we catch the process during a fork+exec, then it is perfectly possible for tsk->mm to change here. Note that this was not a problem in the original version of this patch which just copied the task_struct in grab_task(), because that way we would always be releasing the same mm that we grabbed. Can anybody give a good reason why we need to block in array.c at all? I don't see why we need the down/up(&mm->mmap_sem) code, since we should already hold the global kernel lock and the vma tree should remain intact while we inspect it as long as we do not drop that lock. If we do keep that lock intact from beginning to end then we cannot run the risk of tsk->mm changing out from under our feet. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/27 Message-ID: <fa.ntld9ev.1rnekgi@ifi.uio.no>#1/1 X-Deja-AN: 437700910 Original-Date: Wed, 27 Jan 1999 13:45:22 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990127134448.30467Z-100000@penguin.transmeta.com> References: <fa.j5h4jtv.gkkd9p@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 27 Jan 1999, Stephen C. Tweedie wrote: > > In other words, we are grabbing a task, pinning the mm_struct, blocking > for the mm semaphore and releasing the task's *current* mm_struct, which > is not necessarily the same as it was before we blocked. In particular, > if we catch the process during a fork+exec, then it is perfectly > possible for tsk->mm to change here. Good point. I reverted part of the array.c patches in the released 2.2.0, it seems I should really have reverted them all. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.iq01d6v.1ag0ph5@ifi.uio.no>#1/1 X-Deja-AN: 437769423 Original-Date: Thu, 28 Jan 1999 02:02:38 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990128001800.399A-100000@laser.bogus> References: <fa.j5h4jtv.gkkd9p@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Wed, 27 Jan 1999, Stephen C. Tweedie wrote: > > + * Fixed a race between mmget() and mmput(): added the mm_lock spinlock > > + * to serialize accesses to the tsk->mm field. > > I don't buy it, because we've seen these on UP machines too. Besides, Yes the comment/credits are completly bogus. Excuse me, I was too tired to understand this last night (and btw I thought it was not sure if it was happening also on UP). > in all of the fork/exit/procfs code paths which look to be relevant to > the reported oopses, we already hold the global kernel lock by the time > we start fiddling with the mm references. Adding yet another spinlock > should make no difference at all to the locking. I think this too. My new code made tons of sense to me and since when I finished all my work everything become rock solid, I posted the thing to the list. I don't like lock_kernel(), but yes I noticed too that lock_kernel() should be enough (I noticed it today, last night I had too much lack of sleep to think more about it). > get_status() { > tsk = grab_task(pid); > task_mem() { > down(&mm->mmap_sem); My reason to reinsert the memcpy() was different than your one. It's because sys_wait4 don't hold the kernel lock and does _only_ a spin_lock_irq(tasklist_lock) and then remove the process from the tasklist, so if we don't want to read_lock(tasklist_lock) all the time in array.c we must do the copy of the tsk, to be sure that the task_struct we are playing with, still exists. But holding the tasklist_lock all the time looked not safe to me due the down() in the array.c code... Maybe I've thought stupid/wrong things but with my whole patch applyed the kernel become rock solid and race-free. I'm sure of this. Otherwise I would have not posted so sure of myself ;). Can somebody tell me _exactly_ what the mmap_sem stays for? The kernel is ~always doing a down on the mmap_sem of the process itself. It's _useless_ that way. The only place the kernel is doing a down on another task seems ptrace.c and fs/proc/array.c, so does we have the mmap_sem only for handling correctly such two cases? And is true as I think that doing a down(current->mm->mmap_sem) is _useless_ if every other place on the kernel only does only the same? And finally I reask your question: can we at any time play with the ->mm, mm->vma, pgd, pmd, pte, of a process without helding any semaphore, only having the big kernel lock held? Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.itea3mv.150443l@ifi.uio.no> X-Deja-AN: 437780451 Original-Date: Thu, 28 Jan 1999 03:50:39 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990128023440.8338A-100000@laser.bogus> References: <fa.iq01d6v.1ag0ph5@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Do you want to know why last night I added a spinlock around mmget/mmput without thinking twice? Simply because mm->count was an atomic_t while it doesn't need to be an atomic_t in first place. Obviously if it mm->count need to be an atomic_t, we strictly _need_ also my new mm_lock spinlock in 2.2.1. So you don't buy my code, but now, I don't buy both all /proc mmget/mmput sutff and the mm->count atomic_t. I also avoided all the mmget/mmput stuff in /proc since it _has_ to run just serialized by the mmget/mmput semaphore (otherwise we need my mm_lock). I also removed the semaphore stuff because it seems to me (and to you btw ;) that _all_ places in the mm that does a down(current->mm->mmap_sem), does always then a lock_kernel(). I also removed all the memcpy, we only need the read_lock(tasklist_lock) held in SMP because otherwise wait4() could remove the stack of the process under our eyes as just pointed out in the last email. Not doing in 2.2.1 the mm->count s/atomic_t/int/ due worry of races will mean that array.c in 2.2.1 will be not safe enough without my mm_lock spinlock. Do you understand my point? I rediffed everything against clean 2.2.0, and I am now running it without any problem, seems rock solid from a /proc race point of view. Patch If you see a race in this my new patch, please let me know and probably you'll give me a good reason to reinsert mm_lock ;) Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.j310kuv.j4gfps@ifi.uio.no>#1/1 X-Deja-AN: 437955781 Original-Date: Thu, 28 Jan 1999 15:05:41 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901281505.PAA02861@dax.scot.redhat.com> References: <fa.iq01d6v.1ag0ph5@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Original-References: <199901272138.VAA12...@dax.scot.redhat.com> <Pine.LNX.3.96.990128001800.399A-100...@laser.bogus> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Thu, 28 Jan 1999 02:02:38 +0100 (CET), Andrea Arcangeli <and...@e-mind.com> said: > I think this too. My new code made tons of sense to me and since when I > finished all my work everything become rock solid, Luck! Fwiw, I was unable to reproduce the problem at all even on 2.2.0 with your test script. >> get_status() { >> tsk = grab_task(pid); >> task_mem() { >> down(&mm->mmap_sem); No spinlock change will fix this race (although the memcpy will). > My reason to reinsert the memcpy() was different than your one. It's > because sys_wait4 don't hold the kernel lock and does _only_ a > spin_lock_irq(tasklist_lock) and then remove the process from the > tasklist, OK. > Maybe I've thought stupid/wrong things but with my whole patch applyed the > kernel become rock solid and race-free. I'm sure of this. Otherwise I > would have not posted so sure of myself ;). No, because the fork/exec race is still obviously present in get_stat, and because SMP-only synchronisation fixes cannot fix a problem seen on UP machines. > Can somebody tell me _exactly_ what the mmap_sem stays for? Any modifications or blocking lookups to the mmap structures, and all places where we add a new page into the process page tables. That includes mmap operations and page faults. > The kernel is ~always doing a down on the mmap_sem of the process > itself. It's _useless_ that way. The only place the kernel is doing > a down on another task seems ptrace.c and fs/proc/array.c, so does we > have the mmap_sem only for handling correctly such two cases? Not at all. The whole point of the semaphore is to protect the shared mm when we have multiple threads all mmaping and page faulting independently within the same mm_struct. That's why the semaphore is in the mm_struct, not in the task_struct. > And finally I reask your question: can we at any time play with the ->mm, > mm-> vma, pgd, pmd, pte, of a process without helding any semaphore, only > having the big kernel lock held? Yes. Basically, changing an existing pte needs the kernel lock. Adding or modifying (but not removing) a new pte or modifying the vma tree needs the mm semaphore. We can unmap ptes in swap_out() without the semaphore but with only the kernel lock. We can map new anonymous pages without the kernel lock but with only the semaphore. Everything else needs both. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.j410kuv.i4gfpu@ifi.uio.no>#1/1 X-Deja-AN: 437959449 Original-Date: Thu, 28 Jan 1999 15:09:19 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901281509.PAA02883@dax.scot.redhat.com> References: <fa.itea3mv.150443l@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Original-References: <Pine.LNX.3.96.990128001800.399A-100...@laser.bogus> <Pine.LNX.3.96.990128023440.8338A-100...@laser.bogus> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Thu, 28 Jan 1999 03:50:39 +0100 (CET), Andrea Arcangeli <and...@e-mind.com> said: > Do you want to know why last night I added a spinlock around mmget/mmput > without thinking twice? Simply because mm->count was an atomic_t while it > doesn't need to be an atomic_t in first place. Agreed. > So you don't buy my code, but now, I don't buy both all /proc mmget/mmput > sutff and the mm->count atomic_t. Agreed. mm->count might as well remain atomic because that will help when we come to apply finer grained locking to the mm, but as far as I am concerned we may as well drop pretty much all of the mmget/mmput stuff. The only race I can still see is the possibility of sys_wait* removing the task struct while we run on SMP. > I also removed all the memcpy, we only need the read_lock(tasklist_lock) > held in SMP because otherwise wait4() could remove the stack of the > process under our eyes as just pointed out in the last email. Yep, fine, as long as we keep the tasklist_lock right until the end of our use of the task struct. > Not doing in 2.2.1 the mm->count s/atomic_t/int/ due worry of races will > mean that array.c in 2.2.1 will be not safe enough without my mm_lock > spinlock. Do you understand my point? No, because we already have a sufficient spinlock to protect us. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.nhlf7uv.1lnckgm@ifi.uio.no>#1/1 X-Deja-AN: 437999350 Original-Date: Thu, 28 Jan 1999 09:36:11 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990128093033.32418B-100000@penguin.transmeta.com> References: <fa.itea3mv.150443l@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Andrea Arcangeli wrote: > > Do you want to know why last night I added a spinlock around mmget/mmput > without thinking twice? Simply because mm->count was an atomic_t while it > doesn't need to be an atomic_t in first place. No. Your argument does not make any sense at all. Go away, this is not the time to use magic to make kernel patches. A "atomic_t" _often_ makes sense without having any spinlocks, _especially_ when used as a reference count. Let me count the ways: - when you increase the count because you make a copy, you know that you're already a holder of the count, so you don't need any spinlocks to protect anything else: you _know_ the area is there. - when you decrease the count, you use "atomic_dec_and_test()" because you know that the count was > 0, and you know that only _one_ such decrementer will get a positive reply for the atomic_dec_and_test. The one that is successful doesn't need any locking, because he's now the only owner, and should just release everything. In short, it has to be atomic, and spinlocks never enter the picture at all. Your patches do not make sense. Period. The only thing that makes sense is to revert the array.c thing to the old one that didn't try to be clever. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.nk5n8ev.1k74l0n@ifi.uio.no>#1/1 X-Deja-AN: 438007402 Original-Date: Thu, 28 Jan 1999 09:54:07 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990128095147.32418F-100000@penguin.transmeta.com> References: <fa.j410kuv.i4gfpu@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Stephen C. Tweedie wrote: > > > Do you want to know why last night I added a spinlock around mmget/mmput > > without thinking twice? Simply because mm->count was an atomic_t while it > > doesn't need to be an atomic_t in first place. > > Agreed. Incorrect, see my previous email. It may not be strictly necessary right now due to us probably holding the kernel lock everywhere, but it is conceptually necessary, and it is _not_ an argument for a spinlock. The /proc code has to be fixed, but the easy fix is to just revert to the old one as far as I can see. I shouldn't have accepted the /proc patches in the first place, and I'm sorry I did. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.j6hejmv.nked1q@ifi.uio.no>#1/1 X-Deja-AN: 438014626 Original-Date: Thu, 28 Jan 1999 18:07:57 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901281807.SAA03328@dax.scot.redhat.com> References: <fa.nk5n8ev.1k74l0n@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <199901281509.PAA02...@dax.scot.redhat.com> <Pine.LNX.3.95.990128095147.32418F-100...@penguin.transmeta.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Thu, 28 Jan 1999 09:54:07 -0800 (PST), Linus Torvalds <torva...@transmeta.com> said: > On Thu, 28 Jan 1999, Stephen C. Tweedie wrote: >> > Do you want to know why last night I added a spinlock around mmget/mmput >> > without thinking twice? Simply because mm->count was an atomic_t while it >> > doesn't need to be an atomic_t in first place. >> Agreed. > Incorrect, see my previous email. It may not be strictly necessary right > now due to us probably holding the kernel lock everywhere, but it is > conceptually necessary, and it is _not_ an argument for a spinlock. Linus, we are in violent agreement: see my previous email. :) I agree with both you and Andrea that the atomic_t is not strictly necessary, and agree vigorously that removing it is wrong because it will just make the job of fine-graining the locking ever more harder. As we relax the kernel locks, the atomic_t becomes more and more important. > The /proc code has to be fixed, but the easy fix is to just revert to the > old one as far as I can see. I shouldn't have accepted the /proc patches > in the first place, and I'm sorry I did. Yep, but Andrea did point out what looks like at least one valid race: sys_wait* on a zombie task can remove and deallocate the task_struct without taking the global lock. Reverting the diff is the right thing for 2.2.1, but once we've done that we may need to look at either keeping the task lock until we have finished with the task_struct in array.c, or doing a memcpy on the task while we still have it locked. That does seem to be a valid fix. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.nl4n7ev.1g7ok0k@ifi.uio.no>#1/1 X-Deja-AN: 438014628 Original-Date: Thu, 28 Jan 1999 10:17:37 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990128101220.32418I-100000@penguin.transmeta.com> References: <fa.j6hejmv.nked1q@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Stephen C. Tweedie wrote: > > Yep, but Andrea did point out what looks like at least one valid race: > sys_wait* on a zombie task can remove and deallocate the task_struct > without taking the global lock. Reverting the diff is the right thing > for 2.2.1, but once we've done that we may need to look at either > keeping the task lock until we have finished with the task_struct in > array.c, or doing a memcpy on the task while we still have it locked. > That does seem to be a valid fix. I'd much rather just use some stale "struct task_struct" data. What we _might_ do in /proc, is to just increment the usage count for the (double) page that contains the task structure, so that even if a release() does happen while we're using the page, the page won't get re-used, and we can essentially look at all the info as it was when it was released. Note that by the time it has become a zombie, it will have released all the mm stuff anyway, so we don't even have any dangerous stale mm pointers that we'd follow: we'd only be looking at that one structure. Voila, no memcpy(), no silly locks, no problems. Anyway, for 2.2.1 I don't even want to be clever. As it is (with the bogus array.c race "fixes" removed), the page may get freed without any kernel lock, and we may return _completely_ bogus information, but that is (a) extremely unlikely in the first place and (b) basically harmless and pretty much impossible to exploit. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.j51ekdv.h4ec9o@ifi.uio.no>#1/1 X-Deja-AN: 438022252 Original-Date: Thu, 28 Jan 1999 18:25:08 GMT Sender: owner-linux-ker...@vger.rutgers.edu Content-Transfer-Encoding: 7bit Original-Message-Id: <199901281825.SAA03425@dax.scot.redhat.com> References: <fa.nl4n7ev.1g7ok0k@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> Original-References: <199901281807.SAA03...@dax.scot.redhat.com> <Pine.LNX.3.95.990128101220.32418I-100...@penguin.transmeta.com> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Thu, 28 Jan 1999 10:17:37 -0800 (PST), Linus Torvalds <torva...@transmeta.com> said: > I'd much rather just use some stale "struct task_struct" data. The problem isn't the risk of using stale data: it is the risk of using complete garbage if the task_struct page gets reused. The procfs code does check that tsk->mm is non-zero before following the pointers, but if there is a non-zero address there then it _will_ be dereferenced regardless. > What we _might_ do in /proc, is to just increment the usage count for the > (double) page that contains the task structure, That would certainly take care of it. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.ob9le8v.jke0ob@ifi.uio.no>#1/1 X-Deja-AN: 438033828 Original-Date: Thu, 28 Jan 1999 11:11:49 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990128110737.6130B-100000@penguin.transmeta.com> References: <fa.k1tlmtv.1b46a3f@ifi.uio.no> To: Alan Cox <a...@lxorguk.ukuu.org.uk> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Alan Cox wrote: > > (c) you can check if the thing has disappeared after using it and clear > the buffer if so. Yes, but the problem is that when there _is_ stale data (unlikely), you can actually do the wrong thing before. Incrementing the page counter should fix all problems, but it's such a subtle fix (even though it's essentially just a few one-liners) that I'm not going to do it for 2.2.1, which I want to get out later today. Alan, the only patch I don't have that looks likely for 2.2.1 is the IDE-SCSI thing. Did you have that somewhere? Right now my 2.2.1 patches are: - the stupid off-by-one bug found by Ingo - __down_interruptible on alpha - move "esstype" to outside a #ifdef MODULE - NFSD rename/rmdir fixes - revert to old array.c - change comment about __PAGE_OFFSET - missing "vma = NULL" case for avl find_vma() Holler now or forever hold your peace, because I'd like to get the thing out in a few hours. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.irgjdmv.1b0aph8@ifi.uio.no>#1/1 X-Deja-AN: 438098542 Original-Date: Thu, 28 Jan 1999 23:33:03 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990128232118.797B-100000@laser.bogus> References: <fa.j6hejmv.nked1q@ifi.uio.no> To: "Stephen C. Tweedie" <s...@redhat.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Stephen C. Tweedie wrote: > Linus, we are in violent agreement: see my previous email. :) I agree > with both you and Andrea that the atomic_t is not strictly necessary, > and agree vigorously that removing it is wrong because it will just make > the job of fine-graining the locking ever more harder. As we relax the > kernel locks, the atomic_t becomes more and more important. I'm afraid. I still think that it will never be needed even removing lock_kernel(), because doing that we would need to make atomic the decreasing of mm->count with current->mm = &init_mm, otherwise we would not know if we can touch the current->mm of a process at any time. It's far from be like the semaphore case where we avoid the spinlock to not race because we can order our moves differently if we are in down() or in up(). But maybe I am missing something, but it's what I think though. Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.iogfefv.1e0io93@ifi.uio.no>#1/1 X-Deja-AN: 438114655 Original-Date: Thu, 28 Jan 1999 23:04:42 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990128205918.438B-100000@laser.bogus> References: <fa.nk5n8ev.1k74l0n@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Linus Torvalds wrote: > Incorrect, see my previous email. It may not be strictly necessary right > now due to us probably holding the kernel lock everywhere, but it is > conceptually necessary, and it is _not_ an argument for a spinlock. If you remove the kernel lock around do_exit() you _need_ my mm_lock spinlock. You need it to make atomic the decreasing of mm->count and current->mm = &init_mm. If the two instructions are not atomic you have _no_ way to know if you can mmget() at any time the mm of a process. I repeat in another way (just trying to avoid English mistakes): decreasing mm->count has to go in sync with updating current->mm, otherwise you don't know if you can access the mm of a process (and so also touching mm->count) because the mm of such process could be just been deallocated (the process could be a zombie). As far I can see the mm->count will _never_ have 1 reason to be atomic_t even removing the function lock_kernel() from /usr/src/linux/include/asm/smplock.h . Using an int for mm->count (isntead of atomic_t) is far from be something of magic. Doing that change it means that I don't expect work by magic, but I know that it will work fine. It's instead magic (out of the human reason) how the kernel works now, if you think that we _need_ atomic_t instead of int. It's hard for me to be quiet even if you asked me to go away because I think you are plain wrong telling that we need mm->count atomic_t for the future. Andrea Arcangeli PS. You have not hurted me asking me to go away, because I _always_ do the _best_ I _can_, and if my best is plain wrong I can't change this reality. What I care is to do the best I can. Maybe I should really go away and do something in my sparetime where it's not requested to be clever... I hope it's not the case though and that you wasn't serious. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/28 Message-ID: <fa.lfrtuiv.d1m6i5@ifi.uio.no>#1/1 X-Deja-AN: 438114658 Original-Date: Thu, 28 Jan 1999 14:53:15 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990128144808.956H-100000@penguin.transmeta.com> References: <fa.irgjdmv.1b0aph8@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Andrea Arcangeli wrote: > > I'm afraid. I still think that it will never be needed even removing > lock_kernel(), because doing that we would need to make atomic the > decreasing of mm->count with current->mm = &init_mm, otherwise we would > not know if we can touch the current->mm of a process at any time. What? We can _always_ touch current->mm, because it can _never_ go away from under us: it will always have a non-zero count exactly because "current" has a copy of it. If you want to touch some _other_ process mm pointer, that's when it gets interesting. Buyer beware. > But maybe I am missing something, but it's what I think though. You're missing the fact that whenever we own the mm, we know that NOBODY else can do anything bad at all, because nobody else can ever think that the count is zero. Because we hold a reference count to it. The _only_ interesting case is when multiple threads do "exit()" at the same time, and then one of them will be the last one - and that last one will _know_ he is the last one because that's how atomic_dec_and_tes() works. And once he knows, he no longer has to care about anybody else, because he knows that nobody else can touch that mm space any more (or it wouldn't have decremented the counter to zero in the first place). So not only does mm->count need to be atomic in the absense of other locks, but that atomicity is obviously sufficient for allocation purposes. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.lhbru2v.ch87ib@ifi.uio.no>#1/1 X-Deja-AN: 438138149 Original-Date: Thu, 28 Jan 1999 16:17:42 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990128161635.956I-100000@penguin.transmeta.com> References: <fa.iogfefv.1e0io93@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Andrea Arcangeli wrote: > > If you remove the kernel lock around do_exit() you _need_ my mm_lock > spinlock. You need it to make atomic the decreasing of mm->count and > current->mm = &init_mm. If the two instructions are not atomic you have > _no_ way to know if you can mmget() at any time the mm of a process. Andrea, just go away. The two do not _have_ to be atomic, they never had to, and they never _will_ have to be atomic. You obviously haven't read all my email explaining why they don't have to be atomic. > I repeat in another way (just trying to avoid English mistakes): > decreasing mm->count has to go in sync with updating current->mm, No it has not. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.iuuq3uv.16g06bm@ifi.uio.no>#1/1 X-Deja-AN: 438264947 Original-Date: Fri, 29 Jan 1999 02:47:41 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990129015657.8557A-100000@laser.bogus> References: <fa.lfrtuiv.d1m6i5@ifi.uio.no> To: Linus Torvalds <torva...@transmeta.com> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Thu, 28 Jan 1999, Linus Torvalds wrote: > If you want to touch some _other_ process mm pointer, that's when it gets > interesting. Buyer beware. Infact this is the point. I really think you are missing something. I read your explanation of why we only need atomic_t but it was not touching some point I instead thought about. Ok, I assume you are right. Please take this example: I am writing a nice kernel module that will collect some nice stats from the kernel. I don't have the big kernel lock held, I consider that also __exit_mm() won't have the big kernel lock held. To collect such stats I need to touch all mm_struct pointed by all tsk->mm in the kernel. How can do that without racing? (see below) > You're missing the fact that whenever we own the mm, we know that NOBODY Sure, once we have owned an mm with mmget() we are safe, but my __only__ point is: how can I run a mmget() being sure that I am _getting_ an mm that is not just been deallocated from the process that was owning such mm (and it was the _only_ onwer) that get killed in the meantime on the other CPU? Now I am sure because __exit_mm() is running with the kernel lock held. Another way to tell the same: "how can I be sure that I am doing an atomic_inc(&mm->count) on a mm->count that was just > 0, and more important on an mm that is still allocated? " Now everything is fine of course, because I only need to get the big kernel lock (and for this reason atomic_t _right_now_ is not needed) but as you pointed out, some time in the future we could kill the big kernel lock to scale better in SMP. OK? At that time according to you we'll _need_ (and we'll _only_ need) the mm->count atomic_t to play with the mm of other process without risk of races. I return to the kernel module stat colletctor example: To be sure that the kernel stack of the process will not go away under me I need to held the tasklist_lock. Ok? So i'll do: read_lock(&tasklist_lock); tsk = find_task_by_pid(pid); if (tsk) { struct page * page = mem_map + MAP_NR(tsk); atomic_inc(&page->count); } read_unlock(&tasklist_lock); mdelay(10000000000000); So now I can wait all time I want and nobody can free and reuse my task struct and replace it with garbage under my eyes. OK? Now I want to play with the tsk->mm of the tsk. OK? I'll do: unlock_kernel(); ^^ if (tsk->mm && tsk->mm != &init_mm) { mdelay(2000000000000000000); mmget(); } but between the check `tsk->mm != &init_mm' and mmget() in the other CPU the task `tsk' could have run __exit_mm()!!! (you remeber, we don't have the big kernel lock held in __exit_mm() anymore). So mmget() will reput the mm->count to 1, but there won't be any kind of mm_struct there!!! We are increasing a random kernel space doing atomic_inc(&mm->count)!! The mm_sruct is gone away between tsk->mm != &init_mm and mmget(). This is the race I seen two night ago. I seen it because I seen some report on the list that was Oopsing in kmem_cache_free() called by mmput(). So I didn't thought two times before adding the mm_lock. Do you see why now? Obviously I didn't seen the lock_kernel() (or better I seen it too late and I am been too lazy to remove the mm_lock and since it was not harming I left it there, as an advance for the future). An Oops in mmput() could be explained very well by the race I am trying to point out to you: task 1 task 2 ----------- --------------- mm->count == 1 if (tsk->mm != &init_mm) { /* interrupt */ __do_exit() { tsk->mm = &init_mm mmput(); (mm->count == 0) } mmget() (mm->count == 1 but there' isn't a mm_struct there!!!) } I can't see how _only_ with the mm->count atomic_t we can avoid this scenario (you said that in the last emails!!!!). Maybe I only need a loooong sleep. Excuse me if I am still here (not gone away yet) but such mmget()/mmput() race sound not fixable to me without using a mm_lock spinlock (I am considering to not have the lock_kernel() in __exit_mm() of course). Do you see the future race now? Do you see why I tell you that we can't be race free in looking the mm of _another_ task without a spinlock if __exit_mm() won't own the big kernel lock? Obviously if we'll serialize mmput/mmget and the setting of the tsk->mm using a spinlock, we won't need mm->count atomic_t (as now btw). Do you agree with me now? Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: MOLNAR Ingo <mi...@chiara.csoma.elte.hu> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.nh9l16v.1ikaj8i@ifi.uio.no>#1/1 X-Deja-AN: 438297317 Original-Date: Fri, 29 Jan 1999 12:20:43 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990129120839.22453C-100000@chiara.csoma.elte.hu> References: <fa.iuuq3uv.16g06bm@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Reply-To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 29 Jan 1999, Andrea Arcangeli wrote: > Another way to tell the same: "how can I be sure that I am doing an > atomic_inc(&mm->count) on a mm->count that was just > 0, and more > important on an mm that is still allocated? " you are misunderstanding how atomic_inc_and_test() works. The processor guarantees this. This is the crux of SMP atomic operations. How otherwise could we reliably build read-write spinlocks. yes, there is no atomic_inc_and_test() yet. (it's a bit tricky to implement but pretty much analogous to read-write locks, we probably need to shift values down by one to get the 'just increased from -1 to 0' event via the zero flag, and get the 'just decreased from 0 to -1' event via the sign flag.) Also note that this is all fiction yet because we _are_ holding the kernel lock for these situations in 2.2. -- mingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.ingjduv.1f0aopd@ifi.uio.no>#1/1 X-Deja-AN: 438311254 Original-Date: Fri, 29 Jan 1999 13:08:28 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990129124407.639A-100000@laser.bogus> References: <fa.nh9l16v.1ikaj8i@ifi.uio.no> To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Reply-To: Andrea Arcangeli <and...@e-mind.com> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 29 Jan 1999, MOLNAR Ingo wrote: > yes, there is no atomic_inc_and_test() yet. (it's a bit tricky to _Where_ do you want to run atomic_inc_and_test()? On random kernel data where incidentally a long time ago there was the just deallocated and reused (somewhere else) mm_struct? If we incidentally access the mm_struct and we notice that the mm->count is zero it menas we are just buggy. > sign flag.) Also note that this is all fiction yet because we _are_ > holding the kernel lock for these situations in 2.2. Sure it's finction, but _all_ complains I get against my s/atomic_t/int/ was about the future. Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: MOLNAR Ingo <mi...@chiara.csoma.elte.hu> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.ng9f1mv.1gkgjov@ifi.uio.no>#1/1 X-Deja-AN: 438326114 Original-Date: Fri, 29 Jan 1999 14:19:14 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990129132505.22453D-100000@chiara.csoma.elte.hu> References: <fa.ingjduv.1f0aopd@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Reply-To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu> Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 29 Jan 1999, Andrea Arcangeli wrote: > _Where_ do you want to run atomic_inc_and_test()? On random kernel data note that in 99% of the cases we need the counter only in the clone(), exec() and exit() path, for these three cases we know implicitly that it's a valid buffer. (because we hold a reference to it) [subsequently we dont need any atomic_inc_and_test thing either for clone+exec+exit] An atomic counter is just about perfect for those uses, even in the 'no kernel lock' case. I only looked at your last (fairly large) patch, which does not seem to have _any_ effect at all as far as bugfixes are concerned, except the array.c change (which again turned out to have nothing to do with the atomic vs. spinlock thing). for 'other process' uses (for cases when we want to dereference an mm_struct pointer but do not hold a reference to it, eg. /proc or the VM scanning logic) we will have to do something smart when we remove the kernel lock in 2.3, but it should not increase the cost of the common case (clone() + exit()) if possible. -- mingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.is0bdvv.18guopf@ifi.uio.no>#1/1 X-Deja-AN: 438339708 Original-Date: Fri, 29 Jan 1999 15:14:37 +0100 (CET) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990129144844.886A-100000@laser.bogus> References: <fa.ng9f1mv.1gkgjov@ifi.uio.no> To: MOLNAR Ingo <mi...@chiara.csoma.elte.hu> X-Sender: and...@laser.bogus Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 29 Jan 1999, MOLNAR Ingo wrote: > note that in 99% of the cases we need the counter only in the clone(), > exec() and exit() path, for these three cases we know implicitly that it's > a valid buffer. (because we hold a reference to it) [subsequently we dont > need any atomic_inc_and_test thing either for clone+exec+exit] An atomic > counter is just about perfect for those uses, even in the 'no kernel lock' > case. Sure, if you look at my last email to Linus, you'll see that I am _only_ talking about getting the mm of a random process (not the current one!). Probably I should comment better my patches, but I have really a big leakage of spare time these days but I _don't_ want to decrease the kernel hacking (even if Linus asked me to go away two times). > I only looked at your last (fairly large) patch, which does not seem to > have _any_ effect at all as far as bugfixes are concerned, except the > array.c change (which again turned out to have nothing to do with the > atomic vs. spinlock thing). Infact, the only bugfix is array.c (as I just pointed out clearly in bugtraq). Another reason I didn't either checked about lock_kernel() two night ago before adding mm_lock, is that it was very late, I had a little time and I seen some bugreport on the list that was oopsing in mmput (so I thought "yeah, I seen the race!", I thought it was the mmput/current->mm = &init_mm race). Then I also seen the mm->count as atomic_t so I thought it was really the case. Ah note, there was overhead also in the overhead since checking for mm->count == 0 in mmget was a nono ;). > for 'other process' uses (for cases when we want to dereference an > mm_struct pointer but do not hold a reference to it, eg. /proc or the VM > scanning logic) we will have to do something smart when we remove the > kernel lock in 2.3, but it should not increase the cost of the common case > (clone() + exit()) if possible. Ok this is a completly different story and I am the first that don't want to continue it now, but I want to point out that atomic_t is __far__ to be the only thing we'll nee to make the mm_struct browsing race free (as I understood from Linus's word). I am very very happy to see that I was not (completly ;) crazy. Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds <torva...@transmeta.com> Subject: Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Date: 1999/01/29 Message-ID: <fa.njl97ev.1mnik0k@ifi.uio.no>#1/1 X-Deja-AN: 438425282 Original-Date: Fri, 29 Jan 1999 10:24:16 -0800 (PST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.95.990129101917.12610G-100000@penguin.transmeta.com> References: <fa.iuuq3uv.16g06bm@ifi.uio.no> To: Andrea Arcangeli <and...@e-mind.com> X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Fri, 29 Jan 1999, Andrea Arcangeli wrote: > > > If you want to touch some _other_ process mm pointer, that's when it gets > > interesting. Buyer beware. > > Infact this is the point. I really think you are missing something. I read > your explanation of why we only need atomic_t but it was not touching some > point I instead thought about. > > Ok, I assume you are right. Please take this example: I am writing a nice > kernel module that will collect some nice stats from the kernel. And that's where you have problems. You shouldn't do that, and that's why /proc is such a nasty beast right now. If you want to look at other peoples processes, then the onus should be on _you_ to do all the extra crap that normal processes do not need to do. That extra crap can be a number of things, but you shouldn't penalize the normal path (which is to touch only your own mm space). For example, the thing I suspect we'll have to do in the long run for /proc is: - get the process while holding the tasklist lock, and increment the page count so that even if the process exists, the page does not get unused. - get the kernel lock. Now we know that we're atomic with regard to __exit_mm() - look at tsk->mm: it it is not init_mm, you're now safe, because __exit_mm is called only with the kernel lock held. See? No spinlocks. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/