[Tfug] Persistent Linux and X Crashes. How to track down?
Adrian
choprboy at dakotacom.net
Wed Aug 2 11:23:23 MST 2006
On Wednesday 02 August 2006 10:42, Chad Woolley wrote:
> It was crashed again this morning.
>
> You know, I do think it might be heat. That has some correlation with
> the crashes. My "office" is the storeroom out back, so I keep it cool
> during the day (window AC), but it gets hot at night (with 3 computers
> on 24/7). All fans work. I have a huge CPU fan, multiple case fans,
> and was very careful with my thermal paste, but it could be another
> component besides the CPU.
Well... as others said, a full set of logs might help track down the problem.
Particularly look for anything that signifies a device error (ie. harddisk)
or possibly APIC/bus error. However, even though you said you swapped RAM, my
first thought would be bad RAM or a bad memory bus... Random crashes are not
something a normal Linux box does. Mmany of my boxes (running DB/web/mail)
run for months at a time with continuos multiple-user use, only dying when
the power goes out (which unfortunately seems to happen several times a
year). Out of dozens of boxes I have administered... I have only had 2 that
gave me problems... 1) an old devel box with a haxored SCSI bus and drivers
that I used for extracting data off broken disks, 2) a laptop with bad memory
that would occasionally, but not consistently or predictably, flip a bit
without cause.
I would suggest that you grab a copy of Memtest86 and run it on the machine.
Just grab the bootable ISO and burn a CD. Plop it in, when it starts change
the configuration to "All tests"... and then let it run for a couple hours.
If it is a memory/bus error, Memtest86 should find it.
[snip]
> Do any of you know if there's a relatively cheap product that has
> temparature sensors to capture data, and plugs into a usb/serial port?
> Then I can do trend analysis on the actual temparatures of various
> components, and see if high temparatures correlate with the crashes.
> I'm sure I could build something from scratch, but I don't want to
> take that much time.
>
Better than that.... you can probably use the machine itself to tell you its
temperature. You should have a "sensors" or "lmsensors" package (probably
already installed, but not configured) that can read the temperature sensors
built into the motherboard at various components. After configuring
(sensor-detect.sh), a quick script could periodically dump the temperature to
a file that you could review later looking for trends.
Adrian
More information about the tfug
mailing list