Tuesday, June 24, 2008

WAN-accelerator for NFS

I've been working a bit with WAN-acceleration for CTDB. Actually with two different approaches
for two different purposes.

WAN-accelerator #1 (general purpose)

The first approach was to add new "capabilities" to the CTDB daemon so that you could have a cluster of CTDB nodes where some nodes were located at a very remote site, across a high-latency WAN-link. This was tricky to solve since eventhough you have nodes that participate synchronously in the cluster you do not want the high WAN-link latency to affect performance on the nodes in the main datacentre.

Initial tests seems to indicate that it works quite well. Surprisingly well.

But this is not really a WAN-accelerator. A classic WAN-accelerator is more a device that performs a man-in-the-middle attack on the CIFS/NFS protocols and performs some (sometimes unsafe) caching.

In the CTDB approach above there is no man-in-the-middle attack, nor is it really a WAN-accelerator.
It is conceptually more like one single multihomed CIFS server where one on the NICs (the remote site) happens to be a few hundred ms away. Thus we dont have to play any tricks, nor do any questionable caching, we are still a single cifs server, with fully and 100% correct cifs semantics, its just that this cifs server is spread out across multiple sites.

I.e. clients on the remote site talk to the genuine real cifs server. Not an man-in-the-middle imposter that may or may not provide correct semantics.

WAN accelerator #2 (nfs)

A different solution was based on FUSE and providing very aggressive caching of data and metadata for NFS. This one also seems to perform really well but is obviously less cool than "a single multihomed cifs server spanning multiple sites".

Tuesday, May 6, 2008


CTDB has had event scripts to support managing iSCSI target service for a while.
These event scripts are designed for use with the STGT iSCSI target.

Why pick STGT? when there are so many different iSCSI targets available for Linux.
Well, STGT is the one that comes default with RHEL5 and also what I use with ubuntu.
It also comes with a decent SBC emulation (scsi block command set, to emulate a hard disk) MMC (multimedia commandset, to emulate a DVD drive) and SMC (media changer to emulate a robot/jukebox).

While which iscsi solution is "best" is a never ending source of controversy on lkml, I picked STGT.

To me STGT is attractive since it does all the SCSI processing in userspace and is very simple and easy to enhance. I personally dont like when network services run inside kernel space, and have many times had loud opinions on "why does the $%**@! nfs lock manager run inside the kernel? making it so difficult to fix nlm bugs" when it could(should) run much better in userspace as all other platform does it and would be so much more serviceable to me as a user.

Why would someone want iSCSI and hard disk emulation with CTDB, isnt ctdb just something to build a (VERY fast and VERY resilient) NAS server using samba?
iSCSI is block i/o, why use block i/o serviceses on a NAS device?

Many people that use a CIFS NAS service, and in particular an expensive high-end CIFS NAS server (such as what CTDB/Samba is) often have a large number of windows clients that they want to use and connect to CTDB/Samba.
But since you have a large number of windows clients, you probably also use Exchange and while you cant really put Exchange databases on a NAS share you can put these databases on an iSCSI LUN.
Since the CTDB/Samba NAS server is likely to be the fastest and most expensive storage device you have inside this hypothetical datacentre (and also one of the more fault resilient ones) it would be very attractive to also store a critical application like Exchanges databases on this device.
It is a great value-add to the very expensive NAS box you just installed if you could also store the data for the critical Exchange application on it.

Do try STGT out. It is quite cool and works really well.

Thursday, March 6, 2008

Reliable NAS on linux is HARD!

Setting up a high-end cluster, that should be easy.

When we first started developing CTDB and clustered samba we thought that,
well if we just get CTDB and samba to work then everything else should just be a breeze.
Boy were we wrong.
Getting all components of Linux to work reliably and figuring out HOW to configure linux and its subsystems so that it works reliably is one of the most difficult tasks which we spend a lot of time in the SOFS team.
(IBM Scale Out File Services).
This is important since our customers want to know that they use a configuration that works and that is qualified. It is even moreso important since a naive implementation using stock default linux configurations will likely have "issues".

NEVER assume that anything is mature or works. In particular, DONT if your data depends on it.

You must TEST TEST TEST TEST and finally TEST some more that everything works and that all components can

handle a high load on your highly performing system.

Dont even try just slapping something together if you intend to store any business critical data on it.
Make sure that ALL components and ALL configurations you use are tested and qualified for your use pattern!
(If you use SOFS you can sleep better at night because we have already done all these tests and qualifications for you.)

What have we experienced?

HBA drivers. While a HBA driver may look mature and may look solid, do you know whether the HBA driver
developer is testing and qualifying the driver for use with YOUR cluster filesystem?
In our case we found that you must be VERY careful and change the default config for the HBA and SCSI subsystem to match the use patterns of your cluster filesystem. Or else bad things happen if you use the defaults.
(You dont want to learn about these problems when you are in production. At that stage it is too late)

Linux kernel and real-time signals/async i/o. I dont really know how well tested the stock kernels are

with respect to high stress testing of these features. I DO know that it is reasonably easy

to bring the entire real-time-signal layer in the distro kernels down in such a manner you need a full blown system reboot to recover using the default config settings in stock kernels.

Not fun.

Cluster filesystems and coherent filelocking.

Most cluster filesystems out there seems to never really have been tested for high lock contention where many many processes do byte range locking of the same file at the same time.
We use GPFS, and we use a customized configuration for GPFS that is qualified for SOFS.
Dont use the defaults! Bad things will happen.

Kernel oplocks and leases.
Another area where one needs to be very careful and configure things exactly right.

Kernel modifications and patches.
A lot of the subsystems used by a high end NAS application will excersise parts of the kernel that only has
had light load and testing applied before. There are numbers of kernel modifications that are required
and which are not yet in the distros that are needed. For example a stock linux distro and kernel has probably no hope at all to integrate with HSM in meaningful ways. It may look like it works, but sooner or later you will discover the parts that break.

Dont assume that just throwing some components and applications together will create a "solution".
It wont. Trust me, it will not work. Unless you know exactly how to configure all the components so that they are fully compatible with each others use patterns, I can guarantee that a stock linux distro using stock default configs will have nasty surprises for you waiting to happen.

Dont play games with your data.
Make sure that ALL components in your solution are qualified to work together.

Saturday, February 9, 2008

GuitarHero : example of an active/passive failure mode

Last week was linux.conf. I did a short cameo in tridge's talk about CTDB and clustered samba and improvides a "Tale of two bands" story about different failure modes that are possible in Active/Passive setups, but are not possible in an All-Active setup such as CTDB+Samba.

In the prevoius post I touched on the subject off; "what if the passive node fails to start", while in this post I will theorize about a much worse failure mode; "what if both nodes try to be active at the same time"?

I have a PS2 and the game GuitarHero and one guitar controller. So far so good.

I also have two daughters, Ellie and Hanna aged 9 and 6. See the problem?

My kids are not like the kids in the Cosby family on TV, always happy, polite to each others and sharing their toys. Oh, no. My kids are more of the uncompromising "If I cant play with the toy, no one can!" And they both LOVE GuitarHero. Not good.

Most of the time, there is no problem. We keep the PS2 in Ellie's room and all is well as long as me and my wife can arbitrate access to the PS2 properly. I.e. Ellie can only play when Hanna is at a friends place or when Hanna is asleep. Hanna can only play when Ellie is either not yet home from school or when over at a friends place. No one can play when both are home at the same time since then there will be "trouble".

So, a semi-hypothetical failure mode of GuitarHero (I like to try to pretend it is just hypothetical and that it never happened):

Ellie as over at a friend. I let Hanna go into Ellie's room and play GH. A split brain problem occurred and I forgot to tell my wife about the delicate and problem prone situation.

Hanna is playing the game and having a great time.

Suddenly, Ellie comes home and my wife lets her in. I had forgotten to tell my wife that Hanna was in Ellies room playing GH so my wife was blissfully unaware of the dangers that be and just told Ellie to go to her room and start doing her homework.

Ouch. Ellie walks into her room and sees Hanna, using Ellies console, playing Ellies favorite game. This will end in tears.

Immediately there is a huge fight, tearing hairs, trying to rip the guitar controller from eachother until both are crying and one of them hits the other and then jumps on the console : resulting in a completely crushed console and a completely destroyed game disk. Total Loss of gameDisk. Now daddy is upset too, since he also wants to play the game but as it is no one is going to play that game at all for a very long time, until it has been completely replaced from scratch.

I learnt an important lesson and have rectified the issue. It was really my own fault for not making sure that arbitration of access to the game was done properly, and the consequence I paid was dear. Buying a completely new game disk and also being GuitarHero Unavailable for a long time. (Thankfully they still had replacement copies in one of the store i could buy)

Nowadays I have taken precautions to try to avoid this failure mode from ever happening again.
I have upgraded to GH2 and I have two gitar controllers. GH2 has a multiplayer mode where TWO players can use the game and play the same game at the same time.
This version of this most brill game is "multiplayer aware" and is designed to allow multiple people to access and play the game at the same time.
Everyone is happy in my household.

As an analogy of Active/Passive (hint: this tries to illustrate the failure mode where BOTH the active node and the passive node in a failover samba setup becomes Active at the same time and starts fighting over the shared filesystem, resulting in a total destruction of the ext3 filesystem)

Can you spot who/what :
* what the GH game disk represents?
* what Ellie and Hanna represents?
* were arbitrating access to the shared resources (the single GH controller)
* where the split brain allowing both nodes to become active occured?
* who the active and passive nodes are?
* the consequence when both nodes became active?
* GH+1 controller compares to GH2+2 controllers compares to Active/Passive vs All-Active?

To me and my family, the incident above was a pretty catastrophic failure mode we never want to experience again.

(I'll never go back to using just one GH controller ever again)

In an all-active solution, this failure mode should not happen since all the nodes are designed for shared access, and not as in active/passive where the nodes are designed for exclusive access and when exclusive access is no longer guaranteed ... there will be trouble ...

Monday, January 28, 2008

CTDB doesnt do Failover and this is a good thing

The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

These are my personal thoughts on CTDB and failover.

CTDB while originally developed only as being a component to clusterize samba and to allow building a highly-scalable all-active cluster has evolved into a full blown HA-solution.

Just today I was thinking about what CTDB actually does and how. When thinking about it I realize that technically speaking, Clustered samba doesnt really do failover at all. And that this is In my opinion a good thing.

Hey, what am I saying? I dont do failover for this cluster HA solution and I claim this is a good thing? That sounds absolutely insane doesnt it? Well not really. Let me try to explain...

CIFS is a very tricky protocol to clusterize since CIFS is so very stateful. In many existing HA solutions to make a Samba based NAS server robust against failures, what people often do is
building an Active/Passive solution. Two nodes that are almost identical, both with a CIFS server and both connected to the same storage backend. But only one of the two nodes are active at any given time.
You then have a 2 node cluster with one active node and one passive node and the passive node is in some semi-dormant state, waiting for when it needs to be brought online.

The idea is then that when the active node fails for some reason, HW failure?, you do a failover shutting down the active node completely and boots up the passive node.
So far so good, but assuming you have the most simplest HA solution and your passive linux box with samba is in a powered off state, how can you be sure that the passive node will actually boot correctly and start up?
In general, I dont think you can be 100% sure that the passive node will start up correctly since there could have been some kind of fault developed on the passive node while it was dormant and maybe you wont detect that this fault exists until you actually try to bring the passive node up. At which stage it is too late since the Active node has already failed and the passive node just have to come up without any faults.

This is completely different of CTDB-Samba. CTDB samba does not use an Active/Passive concept. Instead it is an "All-Active" cluster where all nodes at all times are running and actively serving (the same) data. This could be a two node cluster or a multinode cluster.

If we compare a two node CTDB-Samba cluster, the difference between this 2 node cluster and an active/passive Samba failover pair, in the 2 node CTDB-Samba cluster BOTH nodes would normally be active at the same time and where the workload from all attached clients are distributed/shared across the two active nodes.

This means that if one of the nodes fail, we are not really doing a failover, instead we just redistribute/migrate all the clients from the failed node so they instead are taken over by the node that remains. This means that when one node in our two node cluster fails, only 50% of the attached clients will experience a disruption (since only half of the active clients were attached to the failed node instead of 100% of the clients as in the active/passive samba setup.

Fewer clients are disrupted.

Second, when the node failed and I migrate the clients over to the other node, that other node is already active and hosting/serving data. that other node is already running and may have been running for quite a while! This for me means much less uncertainty. No longer a question : "will the passive node start up?" since we never start the other node up! It is already running and it is (hopefully) running fine. It is like having an automatic verification that all systems are go PRIOR to actually implementing the failover procedure.
I.e. I know pretty confidently that the "failover" should work since the node that will take over the workload is already active and verified to be completely healthy.

I think this is a pretty cool side effect of having an All-Active cluster!

Sorry for using the "failover" word several times above since I don't think "failover" is a very accurate word to describe what CTDB does, but I don't have any better word available. :-)


Oh, by the way, feel free to try out CTDB/Samba. It is very cool software and can be downloaded from ctdb.samba.org.


Hello world.

This is my first blog entry.
My name is Ronnie Sahlberg, I work as an open source developer for Linux Technology Centre at IBM. My main focus is to work with Samba and the CTDB component of Samba, but I also hack on Wireshark from time to time.
This is my blog.
The postings on this site are my own and don’t necessarily represent
IBM’s positions, strategies or opinions.

So what is CTDB and what is it that I do?
CTDB (Clustered Trivial DataBase) is a very very thin and VERY fast database that is developed for samba to makeclusterize samba. What CTDB does is to make it possible for Samba to run and serve the same data from several different hosts in your network AT THE SAME TIME!

This means that with CTDB, samba suddenly becomes a clustered service where all nodes in the cluster are active and exports the same samba shares read-write at the same time. This allows for both very high performance (if we manage to scale well, which we do) as well as pretty interesting reliability features.

On this blog I will talk about various issues and topics I work with related to CTDB and what I do with CTDB to make it "tick".