Update 2012-02

In Debian official Linux kernel 3.2 comes with memtest feature enabled.


Starting from version 2.6.26 Linux Kernel has an amazing but little known feature - MEMTEST. Let's explore what is it and what it is good for.

Memtest, if properly activated, will test RAM before allocation and isolate corrupted memory region (till next restart). In some sense it is similar to old badram patch but it works automatically.

Why we might need that feature? Well, memory corruption is quite common. Broken memory module eventually will corrupt data or (if you're lucky) will cause instability in Operation System typically manifested as random errors.

Silent corruption of data may occur for years undetected. It is hard to diagnose because some servers you cannot stop for 24 hours to conduct a routine memory testing. Even if testing possible and you suspect problems damage may already been done.

I have 1GiB memory module (which I took from old server) with just 1 byte broken - a good test subject.

First I ran memtest with only this module installed:

Then I boot Debian GNU/Linux operating system and ran mprime which identified error within 2 hours of testing:

OK, error reproduceable - let's see how mprime will perform with MEMTEST activated in kernel. First I built custom kernel (a Debian way).

sudo aptitude install kernel-package zlib1g-dev libncurses5-dev fakeroot linux-source-2.6.32
sudo adduser my_username src  # this is to work in /usr/src as user
                              # it may be necessary to logoff and login again
                              cd /usr/src
                              tar xjvf linux-source-2.6.32.tar.bz2
                              cd linux-source-2.6.32
                              cp /boot/config-$(uname -r) .config

Edit .config to introduce 'CONFIG_MEMTEST=y' or run

make menuconfig 

And choose "Processor type and features" --> "Memtest"

fakeroot make-kpkg --append-to-version=-memtest --revision=2.6.32a --initrd kernel_headers kernel_image
sudo dpkg -i linux-*memtest*.deb

Now before restarting with new kernel memtest feature has to be activated by boot-time parameter in boot loader:

memtest (to run all test patterns) or

memtest=N (where N is number of test patterns to apply - 1...17 or 0 to disable)

As you may learn from reading file arch/x86/mm/memtest.c extensive number of patterns to test every bit would be 16 so I introduced just 'memtest' to run all 17 tests.

(Being curious, first time I run mprime with memtest=4 but error was still there which proves that small number of patterns may not be enough to isolate problem)

So I edit file /etc/default/grub to have GRUB_CMDLINE_LINUX="memtest" and run 'sudo update-grub' (you may want to check regenerated /boot/grub/grub.cfg)

Then (after rebooting into memtest-enabled kernel) I saw evidence of memtest activity in dmesg:

[    0.000000] early_memtest: # of tests: 17...
[    0.000000]   0000008000 - 000009fc00 pattern 4c494e5558726c7a ...
[    0.000000] early_memtest: wipe out test pattern from memory

I run mprime test for a week with no problems whatsoever. This was fantastic, I knew that this will be a real protection from RAM corruption to my storage system. But what if I take this testing to the extreme by trying another memory module so badly damaged that it has 1161 errors:

mprime running on normal kernel fails in less than half an hour with this RAM. So I booted memtest-enabled kernel and it failed after 21 hours of testing.

This was unexpected - how could it happen if all damaged ram was isolated with extensive 17-patterns testing of every bit? I ran memtest again and it found 1162 errors, 1 more error than before. Obviously another byte of memory degraded during 21 hours-long mprime testing. Error wasn't isolated because memtest-enabled kernel tests memory before allocation so apparently problematic region was already tested. Anyway due to memtest prime ran more than 40 times longer until it was affected by one error, not by massive 1161 errors.

I found this feature quite useful on computers processing important data where data corruption is not an option, especially on file server. You may not notice data/memory corruption but with memtest kernel feature activated it will be much safer. For over 8 months since this experiment I have memtest-enabled kernels running on two desktops and one server - surprisingly I can't feel any difference in performance between memtest and non-memtest kernels.

Unfortunately real-life usage of this feature is limited because kernel has to be rebuilt.

Conclusion:

MEMTEST is great Linux kernel feature useful to minimise damage from corrupted memory modules. With no noticeable overhead it may reduce or eliminate data corruption and significantly improve stability and robustness of mission critical systems.