Best practices for adding 2nd FC switch to fabric on in-production environment


I have a single Brocade Silkworm 200e switch in production right now. The corp exchange server and 3 ESX 3.5 hosts are connected to the clariion cx3 array through it. Port 0,1 are SPA0 and 1, and port 4,5 are SPB0 and 1.

My plan is to add a Brocade Silkworm 300 switch next to the 200 (it's already racked and powered on), go to the datacenter and pull SPA1 and SPB0 out of the 200 and insert them into ports in the 300 switch.

I'm a little paranoid of pulling out FC paths that are in production. I have a logical assumption that things will just fail over to SPA0 and SPB1 and A1 and B0 won't be missed. However, I'd like to have 100% firm understanding of what I could do to further minimize risks if possible.

If a LUN is currently owned by SPA, does it automatically utilize both SPA0 and SPA1 in round robin or does the switch prefer a particular path exclusively unless failed off of it? Example – is exchange server using SPA0 or SPA1, or does it use both 0 and 1 active/active?

I'm guessing that if it's using both paths to a SP active/active that disrupting one of them is less risk because I'm assured it is using the other path already without trouble. I'm scared of forcing failover to an alternate path it hasn't used before and then finding out that cable was wonky or something.

Should I be totally disruptive to the company and shut down all virtual machines and the exchange server just to be sure no data corruption happens in the event of a bad failover? Or is this excessive? Either way, I'm going to do the operation immediately following a full backup cycle.

How would you monitor the failover as it happens? Is the brocade 200e going to log it in detail? I want maximum assurance that everything is still working when I pull those plugs. I can rescan storage on the esx hosts and watch exchange's powerpath monitor. Anything better I can be doing?

I'd rather be far more cautious than the situation merits than make overconfident assumptions about doing something like this for the first time, when all our eggs are in this one basket.

Best Answer

I'm hoping that your plan is to set up a second independent fabric, it's generally considered a good idea.

You don't say whether your servers have multiple HBA's or not. I'd hope so as it will allow you to properly reconfigure for redundant fabrics but if not it wont significantly affect your immediate plan.

Powerpath will handle failover for the Exchange server and should choose a path via A1 when A0 is disconnected, and not B0 or B1 unless both SPA ports have failed. If any paths are not operational it will tell you, or at the very least you wont see the paths that you expect. Depending on which version of Powerpath (ie the SE version or the fully licensed version) you have you may have load balancing multi-path policies active but in any case path failover should be reliable for the setup you describe. If you happen to disconnect an active path Powerpath will re-route the failed IO's through the alternate path provided they are healthy. You can check the path status within the Powerpath GUI or from the command line use powermt check to check for failed\new paths or and powermt restore to check and remove\add dead\new paths. If the path policy is already set for load balancing and there are healthy paths visible via both SPA0 and SPA1 (for example) then you have a pretty high level of confidence that everything is OK.

On the ESX servers you should be able to check the paths available for each LUN from within the VI Client->Configuration->Storage tab. In the properties you can see the available paths, which are active and which are standby and in the Manage Paths dialog you can change the policy (Fixed\MRU\Round Robin). You shouldn't need to change anything but again you will want to make sure that the failover path that you want it to use is available. Again ESX's multi-path stack will handle the failover, if IO's are in flight on an active path it will resend them on another path if it detects that it's failed. ESX 3.5 only supports round robin multipathing experimentally so you don't want to be messing with it in this case. You could set a fixed path policy temporarily and force the LUN's over to the path you want if you want to be proactive but the standard setting for the CX3 is to leave it at MRU and that should be fine.

In both cases there may be some lag before the failover happens and IO's may stall briefly but nothing should fail provided the redundant paths are actually healthy.