From: torva...@transmeta.com (Linus Torvalds)
Subject: Linux-2.1.100...
Date: 1998/05/08
Message-ID: <Pine.LNX.3.95.980507192644.21545A-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 351312390
Approved: g...@greenie.muc.de
Sender: muc.de!l-linux-kernel-owner
Newsgroups: muc.lists.linux-kernel



Ok, I just released 2.1.100, which does:
 - fix an ugly lockup on SMP that could fairly easily happen if you used
   your floppies.
 - various irq/apic fixes - this should get us back to booting on the
   machines that had problems with the earlier versions.
 - capabilities stuff - get rid of many suser() calls to instead use the
   more finegrained capabilities.
 - IDE driver updates
 - Coda FS update
 - various network fixes from David (the oops in the TCP hashing stuff
   fixed etc)

As has already been found out by earlier testers, pppd has problems with
newer kernels. The problems are:
 - using "strcmp()" to do numeric comparisons. That's a no-no, pppd needs
   to be fixed (patches have floated around). It breaks because it thinks
   100 is smaller than 16. 
 - doing a route on a downed device doesn't work in recent kernels (sanely
   enough). Again, a patch to pppd is available.

Do people still have problems with lockups or bootup under SMP with this? 

		Linus



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu

From: ste...@eecs.umich.edu (Steve Hsieh)
Subject: Re: Linux-2.1.100...
Date: 1998/05/08
Message-ID: <Pine.LNX.3.96.980508002933.16483C-100000@bigfoot.eecs.umich.edu>#1/1
X-Deja-AN: 351332097
Approved: g...@greenie.muc.de
Sender: muc.de!l-linux-kernel-owner
References: <Pine.LNX.3.95.980507192644.21545A-100000@penguin.transmeta.com>
Newsgroups: muc.lists.linux-kernel


> Do people still have problems with lockups or bootup under SMP with this? 
> 
> 		Linus


Yes, on my quad ppro alder, I still get lockups with 2.1.100.
The kernel is still running, but I can't start any new processes.
Existing ones hang if they require disk access.  Existing shells, login
are still active, but will hang as soon as you try to do something.
alt-sysreq still works, although I don't know how to interpret any of that
info either.





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: Linux-2.1.100...
Date: 1998/05/08
Message-ID: <Pine.LNX.3.95.980507224726.22916B-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 351342903
Approved: g...@greenie.muc.de
Sender: muc.de!l-linux-kernel-owner
References: <Pine.LNX.3.96.980508002933.16483C-100000@bigfoot.eecs.umich.edu>
Newsgroups: muc.lists.linux-kernel




On Fri, 8 May 1998, Steve Hsieh wrote:
> 
> Yes, on my quad ppro alder, I still get lockups with 2.1.100.
> The kernel is still running, but I can't start any new processes.
> Existing ones hang if they require disk access.  Existing shells, login
> are still active, but will hang as soon as you try to do something.
> alt-sysreq still works, although I don't know how to interpret any of that
> info either.

Ok, it appears that we have a bad case of lost interrupts, where
everything gets stuck in disk-wait. Does a Ctrl+ScrolLock show processes
in "D" state? 

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu

From: ste...@eecs.umich.edu (Steve Hsieh)
Subject: Re: Linux-2.1.100...[bad case of lost interrupts]
Date: 1998/05/08
Message-ID: <Pine.LNX.3.96.980508183951.13743C-100000@bigfoot.eecs.umich.edu>#1/1
X-Deja-AN: 351686140
Approved: g...@greenie.muc.de
Sender: muc.de!l-linux-kernel-owner
References: <Pine.LNX.3.95.980507224726.22916B-100000@penguin.transmeta.com>
Newsgroups: muc.lists.linux-kernel


On Thu, 7 May 1998, Linus Torvalds wrote:

> On Fri, 8 May 1998, Steve Hsieh wrote:
> > 
> > Yes, on my quad ppro alder, I still get lockups with 2.1.100.
> > The kernel is still running, but I can't start any new processes.
> > Existing ones hang if they require disk access.  Existing shells, login
> > are still active, but will hang as soon as you try to do something.
> > alt-sysreq still works, although I don't know how to interpret any of that
> > info either.
> 
> Ok, it appears that we have a bad case of lost interrupts, where
> everything gets stuck in disk-wait. Does a Ctrl+ScrolLock show processes
> in "D" state? 

Yes, I see quite a few processes stuck in "D" state.  It was first
'update', and then after that every process I tried got stuck in "D" until
I ran out of windows that worked. :)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu

From: dledf...@dialnet.net (Doug Ledford)
Subject: Re: Linux-2.1.100...[bad case of lost interrupts]
Date: 1998/05/08
Message-ID: <3553971C.F5B7109A@dialnet.net>#1/1
X-Deja-AN: 351688834
Approved: g...@greenie.muc.de
Sender: muc.de!l-linux-kernel-owner
References: <Pine.LNX.3.96.980508183951.13743C-100000@bigfoot.eecs.umich.edu>
Newsgroups: muc.lists.linux-kernel


Steve Hsieh wrote:
> 
> On Thu, 7 May 1998, Linus Torvalds wrote:
> 
> > On Fri, 8 May 1998, Steve Hsieh wrote:
> > >
> > > Yes, on my quad ppro alder, I still get lockups with 2.1.100.
> > > The kernel is still running, but I can't start any new processes.
> > > Existing ones hang if they require disk access.  Existing shells, login
> > > are still active, but will hang as soon as you try to do something.
> > > alt-sysreq still works, although I don't know how to interpret any of that
> > > info either.
> >
> > Ok, it appears that we have a bad case of lost interrupts, where
> > everything gets stuck in disk-wait. Does a Ctrl+ScrolLock show processes
> > in "D" state?
> 
> Yes, I see quite a few processes stuck in "D" state.

He has processes stuck in a "D" state, but I don't think lost interrupts are
his problem.  He has 4 PPro processor in an Alder MB, and three aic7xxx
controllers on three different IRQs.  The aic7xxx driver doesn't care if we
loose an interrupt as long as one will come along later to alleviate the
problem.  IOW, we clear a complete queue on each interrupt, regardless of
the number of entries in that complete queue.  Besides, the aic7xxx PCI
hardware uses level sensitive interrupts, and if we don't turn those
interrupts off then we simply get more interrupts even without further
completion events pending.  We actually depend on this behavior on the PCI
cards to detect PCI bus parity problems as well.  I would be more suspicious
that there is some area somewhere that has been overlooked in the mid level
SCSI code that allows something along the lines of the enable_IOapic_irq()
problem to occur (specifically, that while inside of a spin lock, we can
attempt a recursive entry on the spin lock) or that something is causing our
local cli() state to get lost while in the spin lock, then we take an
interrupt from another IRQ on the same processor that also wants to grab the
io_request_lock, resulting in deadlock.

-- 

 Doug Ledford  <dledf...@dialnet.net>
  Opinions expressed are my own, but
     they should be everybody's.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: Linux-2.1.100...[bad case of lost interrupts]
Date: 1998/05/08
Message-ID: <Pine.LNX.3.95.980508163902.25393I-100000@penguin.transmeta.com>#1/1
X-Deja-AN: 351688800
Approved: g...@greenie.muc.de
Sender: muc.de!l-linux-kernel-owner
References: <3553971C.F5B7109A@dialnet.net>
Newsgroups: muc.lists.linux-kernel




On Fri, 8 May 1998, Doug Ledford wrote:
> 
> He has processes stuck in a "D" state, but I don't think lost interrupts are
> his problem.  He has 4 PPro processor in an Alder MB, and three aic7xxx
> controllers on three different IRQs.  The aic7xxx driver doesn't care if we
> loose an interrupt as long as one will come along later to alleviate the
> problem.

Note that this problem sounds like either of:
 - egcs problem. Make sure to compile with a standard gcc or at least with
   a plain -O2 (check your gcc options file whether that contains
   additional default options), there is one confirmed report that egcs
   didn't boot with certain options to egcs even though it works fine with
   others.
 - something makes us lose the io-apic completely for a certain interrupt.
   I don't see anything that could do that, but the behaviour sounds like
   we just simply no longer get interrupts from the controller - at all.

For example, let's assume that you have an interrupt on irq16 through a
PCI device, and 2.1.100 for some reason doesn't ACK it. You'll still
continue to get interrupts for higher priority events, but not for that
irq or for anything lower. I don't think this is what 2.1.100 does,
because the priorities are reverse from what this would indicate, but
there may be something we've missed. 

However, at this point I do know that the compiler makes a difference, so
I'd ask everybody to make sure they are using gcc-2.7.2 if they have the
problem.

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu