Thursday, March 6, 2008

Reliable NAS on linux is HARD!

Setting up a high-end cluster, that should be easy.


When we first started developing CTDB and clustered samba we thought that,
well if we just get CTDB and samba to work then everything else should just be a breeze.
Boy were we wrong.
Getting all components of Linux to work reliably and figuring out HOW to configure linux and its subsystems so that it works reliably is one of the most difficult tasks which we spend a lot of time in the SOFS team.
(IBM Scale Out File Services).
This is important since our customers want to know that they use a configuration that works and that is qualified. It is even moreso important since a naive implementation using stock default linux configurations will likely have "issues".


NEVER assume that anything is mature or works. In particular, DONT if your data depends on it.

You must TEST TEST TEST TEST and finally TEST some more that everything works and that all components can

handle a high load on your highly performing system.

Dont even try just slapping something together if you intend to store any business critical data on it.
Make sure that ALL components and ALL configurations you use are tested and qualified for your use pattern!
(If you use SOFS you can sleep better at night because we have already done all these tests and qualifications for you.)

What have we experienced?

HBA drivers. While a HBA driver may look mature and may look solid, do you know whether the HBA driver
developer is testing and qualifying the driver for use with YOUR cluster filesystem?
In our case we found that you must be VERY careful and change the default config for the HBA and SCSI subsystem to match the use patterns of your cluster filesystem. Or else bad things happen if you use the defaults.
(You dont want to learn about these problems when you are in production. At that stage it is too late)



Linux kernel and real-time signals/async i/o. I dont really know how well tested the stock kernels are

with respect to high stress testing of these features. I DO know that it is reasonably easy

to bring the entire real-time-signal layer in the distro kernels down in such a manner you need a full blown system reboot to recover using the default config settings in stock kernels.

Not fun.


Cluster filesystems and coherent filelocking.

Most cluster filesystems out there seems to never really have been tested for high lock contention where many many processes do byte range locking of the same file at the same time.
We use GPFS, and we use a customized configuration for GPFS that is qualified for SOFS.
Dont use the defaults! Bad things will happen.


Kernel oplocks and leases.
Another area where one needs to be very careful and configure things exactly right.



Kernel modifications and patches.
A lot of the subsystems used by a high end NAS application will excersise parts of the kernel that only has
had light load and testing applied before. There are numbers of kernel modifications that are required
and which are not yet in the distros that are needed. For example a stock linux distro and kernel has probably no hope at all to integrate with HSM in meaningful ways. It may look like it works, but sooner or later you will discover the parts that break.


Dont assume that just throwing some components and applications together will create a "solution".
It wont. Trust me, it will not work. Unless you know exactly how to configure all the components so that they are fully compatible with each others use patterns, I can guarantee that a stock linux distro using stock default configs will have nasty surprises for you waiting to happen.


Dont play games with your data.
Make sure that ALL components in your solution are qualified to work together.