Saturday, February 9, 2008

GuitarHero : example of an active/passive failure mode

Last week was linux.conf. I did a short cameo in tridge's talk about CTDB and clustered samba and improvides a "Tale of two bands" story about different failure modes that are possible in Active/Passive setups, but are not possible in an All-Active setup such as CTDB+Samba.

In the prevoius post I touched on the subject off; "what if the passive node fails to start", while in this post I will theorize about a much worse failure mode; "what if both nodes try to be active at the same time"?

I have a PS2 and the game GuitarHero and one guitar controller. So far so good.

I also have two daughters, Ellie and Hanna aged 9 and 6. See the problem?

My kids are not like the kids in the Cosby family on TV, always happy, polite to each others and sharing their toys. Oh, no. My kids are more of the uncompromising "If I cant play with the toy, no one can!" And they both LOVE GuitarHero. Not good.



Most of the time, there is no problem. We keep the PS2 in Ellie's room and all is well as long as me and my wife can arbitrate access to the PS2 properly. I.e. Ellie can only play when Hanna is at a friends place or when Hanna is asleep. Hanna can only play when Ellie is either not yet home from school or when over at a friends place. No one can play when both are home at the same time since then there will be "trouble".


So, a semi-hypothetical failure mode of GuitarHero (I like to try to pretend it is just hypothetical and that it never happened):

Ellie as over at a friend. I let Hanna go into Ellie's room and play GH. A split brain problem occurred and I forgot to tell my wife about the delicate and problem prone situation.

Hanna is playing the game and having a great time.

Suddenly, Ellie comes home and my wife lets her in. I had forgotten to tell my wife that Hanna was in Ellies room playing GH so my wife was blissfully unaware of the dangers that be and just told Ellie to go to her room and start doing her homework.

Ouch. Ellie walks into her room and sees Hanna, using Ellies console, playing Ellies favorite game. This will end in tears.

Immediately there is a huge fight, tearing hairs, trying to rip the guitar controller from eachother until both are crying and one of them hits the other and then jumps on the console : resulting in a completely crushed console and a completely destroyed game disk. Total Loss of gameDisk. Now daddy is upset too, since he also wants to play the game but as it is no one is going to play that game at all for a very long time, until it has been completely replaced from scratch.


I learnt an important lesson and have rectified the issue. It was really my own fault for not making sure that arbitration of access to the game was done properly, and the consequence I paid was dear. Buying a completely new game disk and also being GuitarHero Unavailable for a long time. (Thankfully they still had replacement copies in one of the store i could buy)


Nowadays I have taken precautions to try to avoid this failure mode from ever happening again.
I have upgraded to GH2 and I have two gitar controllers. GH2 has a multiplayer mode where TWO players can use the game and play the same game at the same time.
This version of this most brill game is "multiplayer aware" and is designed to allow multiple people to access and play the game at the same time.
Everyone is happy in my household.


As an analogy of Active/Passive (hint: this tries to illustrate the failure mode where BOTH the active node and the passive node in a failover samba setup becomes Active at the same time and starts fighting over the shared filesystem, resulting in a total destruction of the ext3 filesystem)


Can you spot who/what :
* what the GH game disk represents?
* what Ellie and Hanna represents?
* were arbitrating access to the shared resources (the single GH controller)
* where the split brain allowing both nodes to become active occured?
* who the active and passive nodes are?
* the consequence when both nodes became active?
* GH+1 controller compares to GH2+2 controllers compares to Active/Passive vs All-Active?


To me and my family, the incident above was a pretty catastrophic failure mode we never want to experience again.
:-)


(I'll never go back to using just one GH controller ever again)

In an all-active solution, this failure mode should not happen since all the nodes are designed for shared access, and not as in active/passive where the nodes are designed for exclusive access and when exclusive access is no longer guaranteed ... there will be trouble ...