Segfaults

Mon Feb 14 14:14:17 PST 2000

On Mon, 14 Feb 2000, Devin Carraway wrote:
> Subject: Re: Segfaults
> On Mon, Feb 14, 2000 at 08:37:56AM -0800, ME wrote:
> [...]
> > (I have seen many of the following lead to seg faults depending upon the
> > application being used, and hardware combination. There are web pages "out
> > there" to help you diagnise these.)
> > 
> > A short list:
> > Over-heating of hardware: memory, cpu, bus
> > Over-clocked CPU beyond operating limits
> > Over-clocking of motherboard speed
> > Library files trashed
> > Applications being run are buggy
> > Applications are corrupted
> > Applications or services attempt to write aoutside kernel allocated memory
> > space.
> > System over-heating
> 
> 	The CPU was a PPro 180, so AFAIK it wasn't overclocked; heat I can't
> attest to -- overheating of any of the listed components could induce this
> sort of thing.  The segfaults occured on one particular instance, where the
> box would segfault on any command -- my speculation is that the kernel gets
> the idea that the memory is full or bad, or something goes wrong with the
> alloc routines, causing allocs to fail; fork() entails some alloc, so init
> or the shell can't start anything new, resulting in a largely unrecoverable
> mess.

I did have some problems with a dist of debian like this. It was a problem
with the library in the dist being corrupt, and a problem with their
compiled version of bash. (Probably not the same problem, but...)

When I compiled a static binary of bash, and replaced the dynamicly linked
version of bash, all worked well. (Copied bash on a dfferent linux box.)

I later grabbed another dist, and clobbered everything and started again.
the new copy worked fine.

It might be worth a try...

Another thing to try:
Let the machine cool down over night.
When booting the next day, in lilo choose to boot with the lilo boot
option "init=/bin/bash"

Run a few commands. Be careful to run "sync" if your remount the fs as rw,
or else memory cached data will not be flushed to disk. I would suggest
leaving the fs ro for testing.

If that immediately causes seg faults, it would point to bash, or library 
damage. If not, then leave it running for a while, and check back every 15
minutes or so and run a command or two. If it seems to start with problems
when left on for a while, then hardware failure due to heat is more
likely.

Also, if there is a hardware heating problem which occurs after a certain
amount of time being left on, this heating in memory or of the CPU may
cause corruption of data *during* the installation, and a difference using
"cmp" of the unarchived files in the dist, see if there is a difference in
binaries.

> 	On two prior attempts the unit first panicked repeatedly, and the
> second time oopsed, gpfed or panicked (can't recall which).  This was
> fairly late in a Debian Slink install using a 2.0.38 kernel, everything
> having gone swimmingly up to that stage.

Maybe compile a different kernel and copy of bash on a different machine,
and then copy them using the Debian EBD to startup, or loadlin the new
copiled kernel will help?

> 	It's remotely possible that it's simply a problem with this specific
> PPro board (Tyan Titan Pro; I've got one and never saw a problem, but that
> was in SMP, Nancy's is UP), in which case it'd be worth attempting an
> install with something 2.2-based (e.g. redhat) to see if that breaks.  If
> not, then Debian's potato bootdisks are in testing now and might be usable.

Yeah, I think I would try a newer version of debian, or red hat. If they
offer the same problems, then it is more likely to be a hardware problem.

Your description of what was tried sounds a lot like what I would have
started in my diagnosis. I agree with your intelligent assessment of what
was described. 

(Is is refreshing dealing with fellow Linux users: they seem to have a
pre-disposition to solving problems, and they do it well. Thanks :)

-ME