Sometimes we use ordinary hardware to conduct tests — it helps to understand the various kinds of failures that may less likely to occur with industrial quality hardware.

When we began experimenting with NILFS2 the computer dedicated to the task was regularly hanging for unknown reason. Soon it became clear that sudden system hangs follow after heavy disk IO — a pattern typical to NILFS2 due to background cleaner running most of the time. Often nilfs_cleanerd works continuously for days re-writing all the data in partition to release available free space.

Various system components were suspected and over the period of several months nearly everything was replaced without any effect on stability. System continued to hang after every several hours of heavy IO. It was estimated that we experienced over 60 episodes of hangs when system stopped responding without logging anything of suspicion beforehand.

Finally on-board SATA controllers were switched from AHCI to native mode. That restored stability and system was able to sustain continuous disk IO operations for weeks. However some interesting records started to appear in file /var/log/messages during disk IO operations:

kernel: [...] ata10: hard resetting link

Further investigation revealed that poor quality SATA cable(s) was the reason for those log entries. After replacing the cables with high-quality SilverStone CP03 SATA Cable1 there were no more problems reported.

Conclusion:

One of the benefits of software RAID over hardware RAID is better stability and reporting as controller in AHCI mode may hide problems from operating system. Although industrial quality hardware may not exhibit the problem, Linux kernel gracefully reported the errors and continued to work properly when the hardware silently fail.