From: Chris Puttick (c.puttick_at_oxfordarch.co.uk)
Date: Mon Oct 09 2006 - 14:24:19 CEST
Date: Mon, 9 Oct 2006 13:24:19 +0100 Message-ID: <C9218AFE67211F40913B591EF1337D6B01898137@servermail2.janus2.com> From: "Chris Puttick" <c.puttick@oxfordarch.co.uk> Subject: [suse-sles-e] SLES9 Sun X4100 system freeze
Hi all
Having a little problem with a critical SLES server. Once in a while it stops. Full on stop without warning - no error messages, no preceding slowdown, no resources hitting limits. Device still pings on main interface but not secondaries, which I assume is related to the out of band management capabilities of the hardware, but does not output anything to VGA, respond to keyboard, SSH, telnet, etc..
Hardware is Sun X4100 (dual AMD dual core, 8Gb RAM), running SLES 9 x64 SP3 and VMware Server with 6 or so VMs (see how critical this one is...). The failures have been irregular and fairly lengthy periods of time between them e.g. the last two were 6 days apart, the previous one a few weeks before that. The only consistent issue we can identify is the last entries in the messages log imeediately before the failure:
Sep 30 10:42:34 BRILL kernel: hda: irq timeout: status=0xd0 { Busy }
Sep 30 10:42:34 BRILL kernel: hda: irq timeout: error=0xd0LastFailedSense 0x0d
Sep 30 10:43:04 BRILL kernel: hda: ATAPI reset timed-out, status=0x80
Sep 30 10:43:34 BRILL kernel: ide0: reset timed-out, status=0x80
Sep 30 10:43:34 BRILL kernel: hda: status timeout: status=0x80 { Busy }
Sep 30 10:43:34 BRILL kernel: hda: status timeout: error=0x80LastFailedSense 0x08
Sep 30 10:43:34 BRILL kernel: hda: drive not ready for command
Sep 30 10:44:04 BRILL kernel: hda: ATAPI reset timed-out, status=0x80
Oct 6 09:32:47 BRILL kernel: hda: irq timeout: status=0xd0 { Busy }
Oct 6 09:32:47 BRILL kernel: hda: irq timeout: error=0xd0LastFailedSense 0x0d
Oct 6 09:33:17 BRILL kernel: hda: ATAPI reset timed-out, status=0x80
Oct 6 09:33:47 BRILL kernel: ide0: reset timed-out, status=0x80
Oct 6 09:33:47 BRILL kernel: hda: status timeout: status=0x80 { Busy }
Oct 6 09:33:47 BRILL kernel: hda: status timeout: error=0x80LastFailedSense 0x08
Oct 6 09:33:47 BRILL kernel: hda: drive not ready for command
Oct 6 09:34:17 BRILL kernel: hda: ATAPI reset timed-out, status=0x80
/dev/hda is the DVD ROM drive. However, the DVD drive was empty on both occasions and there was no (user) reason for the drive to be accessed. The log was last rotated on 20/09/2006 and these are the only entries in the log referencing hda or ATAPI. The other possibly associated info is the output of mount, which shows two devices mounted for the DVD rom, one the DVD proper (/dev/hda) and one an AMI Virtual CDROM (/dev/sr0):
/dev/sda2 on / type reiserfs (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
tmpfs on /dev/shm type tmpfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sr0 on /media/cdrom type subfs (ro,nosuid,nodev,fs=cdfss,procuid,iocharset=utf8)
/dev/hda on /media/dvd type subfs (ro,nosuid,nodev,fs=cdfss,procuid,iocharset=utf8)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sr0 on /media/usb-AmericanMegatrendsInc-VirtualCdromDevice:0:0:0 type subfs (ro,nosuid,nodev,fs=cdfss,procuid,iocharset=utf8)
/dev/sdj on /eqlsan type ext2 (rw)
There is and has been no disc actually in the drive for over a month.
All suggestions for the source of the problem and or a solution warmly welcomed!
Regards
Chris
Chris Puttick
CIO
Oxford Archaeology: Exploring the Human Journey
Direct: +44 (0)1865 263 818
Switchboard: +44 (0)1865 263 800
Mobile: +44 (0)7914 402 907
http://thehumanjourney.net
This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com
---------------------------------------------------------------------
To unsubscribe, e-mail: suse-sles-e-unsubscribe@suse.com
For additional commands, e-mail: suse-sles-e-help@suse.com
This archive was generated by hypermail 2.1.7 : Mon Oct 09 2006 - 14:25:53 CEST