How to fix file system corruption occurring under disk load

Recently one of our servers has been suffering from random file system corruptions. Fortunately so far e2fsck had been able to fix the problem, but it was causing some concern none the less.

When we analysed the problem we found that it would occur when the system's disks were put under heavy load. For example we wrote a script to continually create a .tar.gz archive and then check it's CRC. After only a short period of time CRC errors would be reported.

We use Linux software RAID-1 with ext3 file systems, and have always found this combination to be rock-solid. The disks themselves weren't showing any of the usual signs of failure such as seek errors in kern.log or noisy operation, so our suspicions turned to our Promise TX-2 SATA controller. Replacement however did not help.

In In the end it turned out the problem was caused by faulty memory. This was showing up in the form of file system corruption because the kernel uses system memory to cache the file system. Running memcheck for a period of time confirmed one of the dimms was faulty, and replacement has corrected the problem.

So, if your having problems with file system corruptions first design a test to reliably recreatre the problem, then strip out all but one dimm and check again.

Discussion

Enter your comment (wiki syntax is allowed):

Subscribe to the RSS feed for Andy's Debian HOWTOs

Article from Andy's Debian HOWTOs (http://www.besy.co.uk/debian/debian)

 
debian/how_to_fix_random_crashes_under_disk_load.txt · Last modified: 2008/08/01 23:56 (external edit) · [Old revisions]
Recent changes RSS feed Powered by Debian Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki