From: Aaron Lehmann <aar...@vitelus.com>
Subject: The stability crisis
Date: 1999/06/29
Message-ID: <fa.jvgopmv.b04ihs@ifi.uio.no>#1/1
X-Deja-AN: 495369892
Original-Date: Tue, 29 Jun 1999 22:08:30 +0000 (   )
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.05.9906292204410.603-100000@vitelus.com>
To: torva...@transmeta.com, a...@lxorguk.ukuu.org.uk, linux-ker...@vger.rutgers.edu
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Linux 2.2.36 was a very stable kernel. I have never experianced a crash
with it. However, this does not at all hold true for the 2.2.x series.

During the initial stage of the 2.2 series, it was pretty darn stable. I
got about 60 days of uptime out of 2.2.1 until a power failure or a need
to mess with hardware or something. (Actually, now I think it was a hard
lockup). Back then we knew that 2.2 was not at all as stable as 2.0.36,
but we knew it would mature.

WRONG!

Linus waited a few months to open the 2.3 branch. A lot of untested
patches were making it into the 2.2 series! People like me breathed a sigh
of relief when Linus opened up the 2.3 branch. Now we knew that all of the
patches would go into 2.3 and 2.2 would become mature and stable like
2.0.36

But that was only half right. Linus decided to hasten the release of 2.4
to "in the fall", and all of the developers jumped onto the 2.3 kernel,
leaving us with a stable kernel which is totally inadequate.

2.2.10 is by far less stable than any operating system I have used
excuding MacOS. During the past _week_ I have had three oopsen using
kernel 2.2.9 and 2.2.10. I have never had an oops before this week with
the exception of Linux on platforms where the ports are excusabe immature
and on unstable hardware. Once I found a small bug with a friend in 2.0.x
that caused an oops but it wasn't anything major. It was fixed
immediately.

All the attention has shifted to 2.3. Most people as well as benchmarkers
are using 2.2.10. Helloo??? This is a perfect time for Microsoft to spread
FUD since the "stable" branch of Linux is far less stable than even
Windows NT. THIS IS NOT GOOD FOR LINUX OR THE PEOPLE WHO USE IT! Something
needs to be done about this fast. I reccomend that 2.2.10 be made rock
solid. Most features and new device drivers can wait until fall with 2.4.
Of course, 2.4 should be made and kept very stable as a 2.5 or 2.9 is
opened up immediately.

I hate to bitch about stuff like this but if I were to try to write kernel
code I would probably just add more fatal bugs :).

Maybe Alan Cox should voulenteer to maintain 2.2 :). He did a great job
with 2.0.

And all kernel hackers out there, PLEASE help make 2.2 more stable.

Speed is a problem that has been dealt with a lot lately, due to the
numerous benchmarks. I believe that this is also a priority, but
secondary to stability, at least at this level of instability.


Thanks.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds <torva...@transmeta.com>
Subject: Re: The stability crisis
Date: 1999/06/29
Message-ID: <fa.lgrpp1v.b1auij@ifi.uio.no>#1/1
X-Deja-AN: 495379219
Original-Date: Tue, 29 Jun 1999 15:46:52 -0700 (PDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.10.9906291530430.821-100000@penguin.transmeta.com>
References: <fa.jvgopmv.b04ihs@ifi.uio.no>
To: Aaron Lehmann <aar...@vitelus.com>
X-Authentication-Warning: penguin.transmeta.com: torvalds owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu



So why didn't you even include a ksymoops version of the crash? Or a good
hardware description? People do try to follow it, but it's not as if I've
seen very good reports even from people who say it's obviously bad. And
others are completely unable to reproduce the problem, so..

Right now the problem is (a) lack of good data and (b) the fact that there
were very few changes between 2.2.7 (which many claim is stable) and 2.2.9
(which many claim is broken). The major changes were actually just reverts
of 2.2.8 (which _was_ badly broken due to fs) - the majority by far is
actually ARM, Sparc, PPC and alpha merges..

SMP?

MTRR enabled?

gcc version?

Quotas?

		Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Aaron Lehmann <aar...@vitelus.com>
Subject: Re: The stability crisis
Date: 1999/06/29
Message-ID: <fa.jvguqev.f06j9o@ifi.uio.no>#1/1
X-Deja-AN: 495392018
Original-Date: Tue, 29 Jun 1999 23:04:10 +0000 (   )
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.05.9906292250220.821-100000@vitelus.com>
References: <fa.lgrpp1v.b1auij@ifi.uio.no>
To: Linus Torvalds <torva...@transmeta.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I really wish I could report my oopses, but this is a production box and I
can't just let it sit there while I write down an oops. The syslog
doesn't catch the OOPS except for sometimes the first few lines. Uptime is
very important to me, as you have probably noticed from my rant. I don't
want to use a serial console becuase I don't have another machine in the
vicinity of 20 feet that would be capable of easilly logging kernel
messages.

I've heard about a new patch that lets the kernel dump oopsen to a floppy,
and I'll try it. It scares me that I might accidentally leave a floppy in
the drive that actually has data.

As I said in a previous message to linux-kernel, I'd be happy to maintain
a bug database if that would be within my realm of comprehension (I don't 
know very much about the kernel internals...).

The machine is a Cyrix 6x86MX (no SMP) running RedHat 5.1 with most of the
packages at either 5.2 or 6.0 versions. MTRR is enabled in the kernel but
I haven't used it for anything yet so I would assume that it is not
causing problems. I don't run X. No quotas.


[aaronl@vitelus aaronl]$ gcc --version
egcs-2.91.66


On Tue, 29 Jun 1999, Linus Torvalds wrote:

> 
> 
> So why didn't you even include a ksymoops version of the crash? Or a good
> hardware description? People do try to follow it, but it's not as if I've
> seen very good reports even from people who say it's obviously bad. And
> others are completely unable to reproduce the problem, so..
> 
> Right now the problem is (a) lack of good data and (b) the fact that there
> were very few changes between 2.2.7 (which many claim is stable) and 2.2.9
> (which many claim is broken). The major changes were actually just reverts
> of 2.2.8 (which _was_ badly broken due to fs) - the majority by far is
> actually ARM, Sparc, PPC and alpha merges..
> 
> SMP?
> 
> MTRR enabled?
> 
> gcc version?
> 
> Quotas?
> 
> 		Linus
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Mark Hull-Richter <ma...@procom.com>
Subject: Re: The stability crisis
Date: 1999/06/30
Message-ID: <fa.c9l1jlv.n7i0p4@ifi.uio.no>#1/1
X-Deja-AN: 495627048
Original-Date: Wed, 30 Jun 1999 07:56:31 -0700
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <377A301F.BA573561@procom.com>
References: <fa.lgrpp1v.b1auij@ifi.uio.no>
To: linux-ker...@vger.rutgers.edu
Original-References: <Pine.LNX.4.10.9906291530430.821-100...@penguin.transmeta.com>
X-Accept-Language: en
Content-Type: multipart/mixed; boundary="------------90942A2696FE5ABBEFF44B52"
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Procom Technology, Inc.
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I think the problem may have something to do with the size of the oops
information and how much is left on the screen.  I am doing Alpha
development here, and virtually every oops I get a) double oopses and b)
leaves no traces in dmesg or the log.  In this situation, the first oops
is long gone before I can even see it, and the second one is more than
half gone by the time it's done, which means what I see on the screen is
close to useless.  At present this is not a sufficiently critical issue
for us that we need to dive in and debug them, and when it becomes one I
suspect we'll wire up the serial line for a more persistent tracking
device (like a serial printer, a la Ingo's suggestion elsewhere on this
list).

Just my $.02, and only for a part of the issues Linus notes.

Linus Torvalds wrote:
> 
> So why didn't you even include a ksymoops version of the crash? Or a good
> hardware description? People do try to follow it, but it's not as if I've
> seen very good reports even from people who say it's obviously bad. And
> others are completely unable to reproduce the problem, so..
> 
> Right now the problem is (a) lack of good data and (b) the fact that there
> were very few changes between 2.2.7 (which many claim is stable) and 2.2.9
> (which many claim is broken). The major changes were actually just reverts
> of 2.2.8 (which _was_ badly broken due to fs) - the majority by far is
> actually ARM, Sparc, PPC and alpha merges..
> 
> SMP?
> 
> MTRR enabled?
> 
> gcc version?
> 
> Quotas?
> 
>                 Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/


From: Matthew Vanecek <mev0...@unt.edu>
Subject: Re: The stability crisis
Date: 1999/07/01
Message-ID: <fa.dnc08jv.12umbu@ifi.uio.no>#1/1
X-Deja-AN: 496078407
Original-Date: Thu, 01 Jul 1999 11:04:42 -0500
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <377B919A.897A81A4@unt.edu>
References: <fa.c9l1jlv.n7i0p4@ifi.uio.no>
To: linux-ker...@vger.rutgers.edu
Original-References: <Pine.LNX.4.10.9906291530430.821-100...@penguin.transmeta.com> 
<377A301F.BA573...@procom.com>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: University of North Texas
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Mark Hull-Richter wrote:
> 
> I think the problem may have something to do with the size of the oops
> information and how much is left on the screen.  I am doing Alpha
> development here, and virtually every oops I get a) double oopses and b)
> leaves no traces in dmesg or the log.  In this situation, the first oops
> is long gone before I can even see it, and the second one is more than
> half gone by the time it's done, which means what I see on the screen is
> close to useless.  At present this is not a sufficiently critical issue
> for us that we need to dive in and debug them, and when it becomes one I
> suspect we'll wire up the serial line for a more persistent tracking
> device (like a serial printer, a la Ingo's suggestion elsewhere on this
> list).
> 
> Just my $.02, and only for a part of the issues Linus notes.
> 
> Linus Torvalds wrote:
> >
> > So why didn't you even include a ksymoops version of the crash? Or a good
> > hardware description? People do try to follow it, but it's not as if I've
> > seen very good reports even from people who say it's obviously bad. And



That's pretty much been my experience. 2.2.10 crashes on a regular
basis.  Certainly more than previous kernels.  Why? Who's to say?  It
leaves behind no information.  Nothing in the logs, nothing in dmesg
(which changes with each boot up, anyhow).  There's no way to try the
magic SysRq key, as the keyboard is completely locked up.  I can't
telnet/ssh/ftp/ping the box, as it evidently stops processing all
network requests.  In short, there is absolutely no indication, not the
slightest oops or byte left over, to even begin to give the inkling of a
clue about why the system crashed.  So how do you debug that?  I don't
even know how to cause the crash; usually i get up in the morning, or
come home from work, to find the machine all locked  up.

This particular machine is an AMD K6-2/350 *not* OC'ed, 64M Ram, Asus
P5A mobo, Buslogic BT-932 with a 4.5 Seagate, 24x Panasonic, and a JVC
2010 clone CDR, with a NetGear FA310TX (new one) nic, and an SiS 6326
video card.  I use Redhat 6.0, fixed.  Currently, I have X (SVGA),
Window Maker 0.60.0 (I upgraded, in case it was WM causing lockups),
Samba 2.0.4b, knfsd 1.3.2, and my gnome is custom compiled.

OTOH, my masq machine doesn't crash.  It's a little 486/120 w/32M Ram,
mobo unknown, running a D-Link 220 NIC, an SIIG (Promise chipset) EIDE
controller card, an Orchid Fahrenheit video card (rarely used), and a
Zoom modem.  With the exception or X programs, it's got the same
software as my workstation, RH 6.0, samba, knfsd, etc.  Just no X.

I suffer, and hope that 2.2.11 will solve my problems with my WS.
-- 
Matthew Vanecek
Course of Study: http://www.unt.edu/bcis
Visit my Website at http://people.unt.edu/~mev0003
For answers type: perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
*****************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me. I'm always getting in the way of something...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: br...@worldcontrol.com
Subject: Re: The stability crisis
Date: 1999/07/02
Message-ID: <fa.i6n72nv.6k65h6@ifi.uio.no>#1/1
X-Deja-AN: 496368903
Original-Date: Fri, 2 Jul 1999 02:54:25 -0700
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <19990702025425.B1097@top.worldcontrol.com>
References: <fa.lgrpp1v.b1auij@ifi.uio.no>
To: linux-ker...@vger.rutgers.edu
Original-References: <Pine.LNX.4.05.9906292204410.603-100...@vitelus.com> 
<Pine.LNX.4.10.9906291530430.821-100...@penguin.transmeta.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mime-Version: 1.0
User-Agent: Mutt/0.96.2i
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Tue, Jun 29, 1999 at 03:46:52PM -0700, Linus Torvalds wrote:
> So why didn't you even include a ksymoops version of the crash? Or a good
> hardware description? People do try to follow it, but it's not as if I've
> seen very good reports even from people who say it's obviously bad. And
> others are completely unable to reproduce the problem, so..
> 
> Right now the problem is (a) lack of good data and (b) the fact that there
> were very few changes between 2.2.7 (which many claim is stable) and 2.2.9
> (which many claim is broken). The major changes were actually just reverts
> of 2.2.8 (which _was_ badly broken due to fs) - the majority by far is
> actually ARM, Sparc, PPC and alpha merges..
> 
> SMP?
> 
> MTRR enabled?
> 
> gcc version?
> 
> Quotas?
> 
> 		Linus

I'm not the person Linus was addressing, but I've had plenty of
oopses with 2.2.1 - 2.2.10 and have not sent any in.  

So far as I know there are only two ways to capture the data related to
an oops.  Write it down with a pencil, or capture it via a serial port
on another machine.  The first seems too prone to errors, and the second
just isn't realistic for me and my cluster of machines.  Too many
serial cables going every which way.  Or maybe I'm just lazy.

I have a setup which oopses in 5 minutes to a few days when compiled
with SMP support.  The identical source compiled without SMP runs
forever as far as I can tell.

Since all things have previously be discussed on this list, I'm going
to let my linux "newbieism" show by asking for a feature which has
undoubtably been asked for before and has undoubtable been shot
down for very legitimate reasons.

I would like my oops'ing systems to send the oops to another system via
an ethernet interface.  How about a UDP packet?  Nice connectionless
protocol. Compile the MAC/IP address into the kernel.  Opps occurs,
build the UDP packet with the measly 2K oops message in it and send.

-- 
Brian Litzinger <br...@litzinger.com>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/