Personal tools
You are here: Home kaeru's blog Archive 2008 January 17 Tracking down unstability and recovering FreeBSD systems
Document Actions

Tracking down unstability and recovering FreeBSD systems

by kaeru last modified 2008-01-17 12:11
Filed Under:

gambit our main server is now finally stable after several days of unstability after upgrading memory to 2GB. The nature of the problems, strongly makes me suspect that there is a hardware issue in the storage system (possibly chipset of memory/disk controllers). This server was bought on a tight budget a few years back. It is extremely rare for a FreeBSD point release to show random unstability, when it isn't under a heavy load. Unless you're doing something silly like compiling kernel with experimental features and chflags of -O88 -f14m1337.

At the data centre we found out that gmirror is causing kernel trap 9 upon reboot after the server starts having stability issues and random processes core dump. I'm not sure why yet at this stage. On a full disk mirror setup, gmirror module is loaded in /boot/loader.conf. So you will need to go into fixit mode from CD.

For those not familiar with rescuing FreeBSD systems, the first disk has a fixit live file system which you can access from the sysinstall installation menu. This gives you access to a variety of recovery tools and network access. This will allow you to dig around, mount file systems (including external drives) and backup vital data before you try to recover it.

Disabling gmirror leads to a reboot loop, even with correct fstab. If you want to get back the system to basic install again including GENERIC kernel, choose upgrade and map your mount points. If you haven't already, it will backup your /etc to /var/tmp/etc. This got the machine booted up normally again. A quick recompile of the kernel for firewall options and the server is back up again without gmirror.

So far it looks like it's running fine with no issue such as random crashes of processes.

Puzzling unstability

This has been most puzzling, on why upgrading to 2GB ram (from 1GB) would suddenly cause unstability. Key suspect would of course be ram, but overnight testing with memtest86 revealed no errors. Everything has been setup as before for which it has never crashed except due to the USB drive (which has been removed).

The kernel panics, lead to possible issue with gmirror, but testing outside of the data center, including multiple reboots and resets did not result in any unrecoverable errors or kernel panics. Removing gmirror did solve the problems, but it isn't a scientific explanation.

I did further testing at home on an even heavier loaded development server. This server has 2x80GB and 2x250GB gmirrored drives and 5 md mounted image files for jails. I ran a stress test of the file system, multiple read/writes through port updates, a gnome build, locate updatedb, file search, make buildworld -j4, and normal use (file server for music etc) simultanaeously. No problems. Which is to be expected, as probably thousands or people are using gmirror in production systems.

The only similar thing I've seen is a faulty network card (hardware). At this stage, I'm also thinking the same as the SiS 760MG chipset motherboard which doesn't even support ECC memory.

As long it stays stable, I'm going to hold off on getting a new server for now, with second drive holding full backups. The server has very little load even when serving multiple Zope/Plone sites and virtual servers. The schedule is to move to a proper quad core opteron with 4-8GB of ram from Dell, HP or Sun coinciding with the release of FreeBSD 7.1 later this year.


Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: