Anyone who deals with Sun/Storagetek SAN hardware will know all about firmware upgrades and firmware versioning. It’s tricky to get the right level for you at the right time, and of course just like any firmware, there’s always a newer one to fix bugs you’ll “probably” never have.
At the same time, there are also updates which you need, you just don’t know that until you somehow envoke them.
Now as I explained to my boss and colleagues, I could spend days reading Sun bug reports and still be very little wiser, but truth is, they just aren’t published, as a lot of what they hold is commercially sensitive/damaging.
We have 7 Storagetek (Sun->Oracle) arrays, 1 is retired, 2 are now off maintenance and used as scratch area (trade in just doesn’t get you much, it’s more useful working for us). The other 4 are very much live, 3 x 6140 and 1 6540.
Across our 3 5140 arrays we run pretty old firmware, 6.19.xx.xx. Why? Well partly the old adage of it aint broke.. and also we simply don’t require a lot of the newer feature sets.
Going back 3 years nearly, Sun introduced the ‘crystal’ firmware, the 7.YY.xx.xx range of firmware. It introduced many bug fixes, but also removed the 2TB lun limit with the 6 series firmware had. The upgrade wasn’t (isn’t) trivial, and as we had little reason to trip 2Tb, we elected to stay put. This is a fully supported thing to do.
All was fine until, we hit 4 conditions.
1. We use RVM (remote volume mirroring). 2 of our SAN’s replicate certain luns to each other. I’d always been dubious about this setup as various things had been done badly (mirror db on same spindles etc). It wa an inherited config.
2. We have firmware 6.19.xx.xx
3. We use VMWare 3.5U4.
4. A disk failed in a RAID10 group, where the RVM and VMWare storage was held.
This triggered a lesser known fault where VMWare fails to receive the correct heartbeat SCSI bus reponse from the array via the vmfs driver, and it corrupts the MFT (master file table).
How?.. Well that’s partly a mystery as neither Sun or VMWare will give us detail..Why? It’s commercially sensitive and embarrasing to both the chip manufacturer and VMWare.
So, I’m faced with a fault where though VMWare can see it’s volumes, browsing them shows no files. The file/data is there, it’s just that the failed heartbeat commands have caused the MFT to be overwritten. The result is host OS’s on VMWare just stop, or report they can’t read their disk, BSOD etc. IMagine it, a 7 host HA farm, which loses 48 of it’s 138 guest OS’s…. finance systems, email, SQL, you name it.. it died.
Now without backups and a neat tool that replicates the vmdk files off site (and on site) we would have been f*cked. Even so, it’s a mountain of work to start recovering that many systems, overnight, to be online. On top of this, you are restoring them to a setup which just caused the problem, but where else, this isn’t trivial storage size.
We did it, but that’s not my point here…
So, we discover more detail. Sun engaged Vmware and LSI to look into the issue. Vmware analysed that the corruption occurs in the metadata and heartbeat records, as the vmfs driver has a pending heartbeat update but fails to find the heartbeat slot. This indicates corruption.. You can find indications of the heartbeat corruption in the logs with an error along the lines of “Waiting for timed-out heartbeat”. LSI have also looked into why this happens and identified that VMware is not following the SCSI specification regarding handling Aborted commands at certain levels of code.
We gathered logs for VMWare and logs for Sun, and in essence the answer back from both parties (who in 1 way or another blame each other… or in reality LSI the chip manufacturer) is to upgrade to the crystal firmware.
The testing done by Sun/LSI/VMWare reproduced the bug, but showed it only happened every 4 hours (a cycle based thing) and only for 1-2 seconds. So if you blow out a disk in those 12 seconds across 24 hrs, you trigger this bug… LUCK.
The next twist of this, is that at code 6.19.xx.xx, you run Suns own multipathing software, rdac, which makes sure you only ‘see’ your lun the once, as opposed to 4 times (depending on cabling/path redundancy you put in) we always see the lun 4 times without rdac (2 hba’s in the host, and 1 link to each controller in the array)
Code 6.60.xx.xx and above allows you to run both rdac and Microsofts MPIO (for windows hosts), howerver version 7 code onwards, only supports MPIO.
I’m not saying that rdac won’t work on the high code, or MPIO won’t work on the low code, but they aren’t SUPPORTED.. magic words in a support agreement.
To get from 6.19.xx.xx to 7.60.xx.xx we have to go to 6.60.xx.xx in between, else we can’t get a means of converting our windows hosts to MPIO (and testing!!, this is production kit remember in a 24/7/365 operation) Yes, I have to arrange to take down 138 VM Hosts and about 16 windows hosts..twice.
Another important detail is this hasn’t been ‘fixed’ by VMWare, even in VSphere 4, they don’t regard it as their fault (despite it being a scsi bus standards issue or their lack thereof) LSI had to write in a seperate VMWare host region to take it away from ‘Linux’, so there is a specific region just for VMWare with what are clearly non-standard responses to SCSI based commands requested from the VM hosts…. I find that.. ODD.
The moral of the story? There isn’t one, there is no right/wrong way here in firmware terms, horse for course. It’s pure chance we hit a disk failure in that situation. We get a fair amount of regular disk failures, it’s the nature of the beast, and on this array. This 1 time, we triggered the event.