From: hedrick@dumas.rutgers.edu (Charles Hedrick)
Newsgroups: alt.os.linux
Subject: another dead filesystem and that fsck can't fix
Date: 2 Feb 92 20:28:48 GMT
Organization: Rutgers Univ., New Brunswick, N.J.

I'm a victim of what is probably the same problem somebody reported a
bit ago: I have a directory that fsck complains doesn't have . and ..
at the beginning.  Whenever I try to look anyhthing in it, the kernel
panics, claims it's trying to deallocated 0.  fsck reports an error,
but doesn't do anything about it.

I can probably rebuild the file system, but it's a pain.

By the way, it's now pretty clear that there's a timing problem (a
race or something) in the file system or hd code.  Basically whenever
I am doing file system I/O in two jobs at the same time (e.g. on two
screens), I lose.  Examples: copying a large file from MSDOS to linux,
using mread.  At the same time I log in on a different screen, which
does a bit of I/O (the login program, init files for bash).  The
system hung.  Or extracting from a large tar file and simultaneously
doing ls and du to see how things are progressing.  My disk is fairly
fast (it's one of the new Connor IDE disks, which I believe is 8 msec
average seek).  Perhaps it turns up race conditions not seen with
slower disks.

This makes the system sort of dangerous to use, given that fsck won't
fix it.  Even a way to manually remove the directory would be welcome.

From: rad@merlin.think.com (Bob Doolittle)
Newsgroups: alt.os.linux
Subject: Re: another dead filesystem and that fsck can't fix
Date: 13 Feb 92 13:50:51 GMT
Organization: Thinking Machines Corporation, Cambridge Mass., USA
NNTP-Posting-Host: merlin.think.com
In-reply-to: zuazaga@ucunix.san.uc.edu's message of 4 Feb 92 19:56:05 GMT


In article < Feb.2.15.28.47.1992.19090@dumas.rutgers.edu> 
hedrick@dumas.rutgers.edu (Charles Hedrick) writes:
>By the way, it's now pretty clear that there's a timing problem (a
>race or something) in the file system or hd code.  Basically whenever
>I am doing file system I/O in two jobs at the same time (e.g. on two
>screens), I lose.

As I said in an earlier posting, when I tried to copy partitions via:
"tar cvf - foo bar | (cd blech; tar xf -)"
it hung my system as well.  It copied a few files, then the disk stopped
being accessed and everything just sat there.  Sounds like the same
problem.

What tools are folks using to debug kernel problems?  There is no adb or
even ps yet, so what do you do?  dd from /dev/mem and disassemble?  kernel
printfs (eek!)?

Enquiring minds need to know...

-Bob

------
Bob Doolittle
Thinking Machines Corporation
rad@think.com
(617)234-2734
--

--------------------------------------------------------------------------------
Bob Doolittle					   Thinking Machines Corporation
(617) 234-2734						        245 First Street
rad@think.com						     Cambridge, MA 02142
--------------------------------------------------------------------------------

From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds)
Newsgroups: alt.os.linux
Subject: Re: another dead filesystem and that fsck can't fix
Date: 14 Feb 92 11:00:31 GMT
Organization: University of Helsinki

In article < RAD.92Feb13145051@merlin.think.com> rad@merlin.think.com 
(Bob Doolittle) writes:
>
>As I said in an earlier posting, when I tried to copy partitions via:
>"tar cvf - foo bar | (cd blech; tar xf -)"
>it hung my system as well.  It copied a few files, then the disk stopped
>being accessed and everything just sat there.  Sounds like the same
>problem.

Well, yes. I'm still hoping it's the out-of-memory bug (which I have
corrected), but I'm looking into the fs as well :(.

>What tools are folks using to debug kernel problems?  There is no adb or
>even ps yet, so what do you do?  dd from /dev/mem and disassemble?  kernel
>printfs (eek!)?

Tools? We don't need no ...  :) Printk's in the kernel is the standard
"debugging" trick.  If somebody comes up with something better, feel
free to post: I'm not too happy about it either, but it's simple. 

This doesn't just extend to the kernel: debugging user programs isn't
exactly easy under linux either :( - I've resorted to things like

$ od -hx executable | less

to find errors efter a program crash... That's what the debugging info
printed after exceptions is there for. Oh, well..

		Linus

From: bir7@leland.Stanford.EDU (Ross Biro)
Newsgroups: alt.os.linux
Subject: Re: another dead filesystem and that fsck can't fix
Date: 15 Feb 92 20:47:54 GMT
Organization: DSG, Stanford University, CA 94305, USA

In article <1992Feb14.110031.2731@klaava.Helsinki.FI> 
torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds) writes:

>This doesn't just extend to the kernel: debugging user programs isn't
>exactly easy under linux either :( - I've resorted to things like

	GDB is almost useable.  The current status is that it can
set break points, check memory (variables) look at source, and restart
breakpoints that were coded into the executable.  Currently it cannot
restart breakpoints which were set from within GDB.  I'm working on it.

	Ross Biro bir7@leland.stanford.edu

From: joel@wam.umd.edu (Joel M. Hoffman)
Newsgroups: alt.os.linux
Subject: Re: [file system problem or memory problem?]
Date: 16 Feb 92 21:59:24 GMT
Organization: University of Maryland at College Park
Nntp-Posting-Host: rac2.wam.umd.edu

Many people have reported that Linux crashes during disk-intensive
operations, and have specualted that it's either a file system probem
(unlikely) or a mem. management problem (more likely, they say).  Is
it possible that it's a hard-drive problem?  

I know that on my machine (a 386 at 25MHz, IDE drive), DJGPP (GCC for
DOS) occaisonally reports a ``Not Ready Error Reading Drive C:'' which
or course is preposterous.  It's a fixed disk and is always ready.
GNU Emacs (DEMACS) also crashes sometimes during disk access,
presumably because it's getting the not ready error and doesn't know
what to do about it.  Does the kernal check for this?

I don't really know what can be done about the problem.  I know that
(with DJGPP) by the time the error message pops up, it's too late.
The machine has crashed.  But perhaps this need not be so.

Alas, like so many other problems, I have yet to find an exact way of
replicating the problem.  Editing a 4MB binary file with DEMACS
usually does it, though....

From: bir7@leland.Stanford.EDU (Ross Biro)
Newsgroups: alt.os.linux
Subject: Re: [file system problem or memory problem?]
Date: 17 Feb 92 06:53:09 GMT
Organization: DSG, Stanford University, CA 94305, USA

In article <1992Feb16.215924.3334@wam.umd.edu> joel@wam.umd.edu 
(Joel M. Hoffman) writes:
>Many people have reported that Linux crashes during disk-intensive
>operations, and have specualted that it's either a file system probem
>(unlikely) or a mem. management problem (more likely, they say).  Is
>it possible that it's a hard-drive problem?  
>

	Another data point.  I have a 386/20 8 meg with a 330 meg ESDI
hard drive.  I think there is a hardware problem related to the hard
drive, ESIX would periodically hang with the disk-access light on, and
sometimes complain about nmi's.  These things would always happend
when the hard drive was under intensive use.  Linux has crashed with
the hard drive light on a few times, and with it off many times.  One
time the crash happened when I had about 800 pages of free memory.  I
have never had a problem under dos.  Perhaps other people are
experiencing similiar hardware problems.  I know the sytems are the
similiar.

	Ross Biro bir7@leland.stanford.edu

From: joel@wam.umd.edu (Joel M. Hoffman)
Newsgroups: alt.os.linux
Subject: Re: [file system problem or memory problem?]
Date: 17 Feb 92 13:42:57 GMT
Organization: University of Maryland at College Park
Nntp-Posting-Host: rac2.wam.umd.edu

In article <1992Feb17.065309.7827@morrow.stanford.edu> 
bir7@leland.Stanford.EDU (Ross Biro) writes:
>In article <1992Feb16.215924.3334@wam.umd.edu> joel@wam.umd.edu (Joel M. Hoffman) writes:
>>Many people have reported that Linux crashes during disk-intensive
>>operations, and have specualted that it's either a file system probem
>>(unlikely) or a mem. management problem (more likely, they say).  Is
>>it possible that it's a hard-drive problem?  
>>
>
>	Another data point.  I have a 386/20 8 meg with a 330 meg ESDI
>hard drive.  I think there is a hardware problem related to the hard
>drive, ESIX would periodically hang with the disk-access light on, and
>sometimes complain about nmi's.  These things would always happend
>when the hard drive was under intensive use.  Linux has crashed with
>the hard drive light on a few times, and with it off many times.  One
>time the crash happened when I had about 800 pages of free memory.  I
>have never had a problem under dos.  Perhaps other people are

One more point of clarification.  When my system would crash, the hard
drive light would also stay on.  And I never experienced the problem
in real mode, only protected. 

-Joel