Sunday, September 29, 2013

Hyper-V 2012 CSV IO issues when deleting or moving Virtual Disks

Recently I was involved in a Hyper-V 2012 implementation. The environment consisted of Dell M620 Blades and Dell Compellent Shared Storage over 10Gbps iSCSI connectivity.

After the implementation, we began migrating VM's from the previous VMware environment to the new Hyper-V 2012 environment. As part of this process we were performing V2V conversions in Virtual Machine Manager 2012 SP1. As you may be aware during a V2V conversion process the VM is converted to VHD, to leverage the new VHDX capacity and performance features we then converted each VHD to VHDX format. This process then involved removing the left over VHD....and this is where things began to appear to be a problem.

During the process of deleting the VHD from a Clustered Shared Volume (CSV) we began to see all the VM's on the hosts to pause or restart. After further investigation we came across the following event entries:

A.After we deleted the vhd files, there is “IO_TIMEOUT” issue reported on the host machine.

00002e64.00001a4c::2013/07/23-13:46:37.844 INFO  [DCM] CsvFs event CsvFsVolumeStateChangeFromIO for volume NSW_NSWHYPERV-01_06_T23_DR:bcd163b1-00bd-4aa0-b524-9e3758f106ec, status STATUS_IO_TIMEOUT(c00000b5)
00002e64.00001a4c::2013/07/23-13:46:37.844 INFO  [DCM] UnmapPolicy::enter_CountingToBad(bcd163b1-00bd-4aa0-b524-9e3758f106ec): goodTimer P0...75, badTimer R0...150, badCounter 1 state CountingToBad

B. To handle the CSV c00000b5 event, the volume state was changed into ‘Draining State’, then ‘Pause’;

00002e64.00001a4c::2013/07/23-13:46:37.849 INFO  [DCM] CsvFs Listener: state [volume bcd163b1-00bd-4aa0-b524-9e3758f106ec, sequence <4:9><23>, state CsvFsVolumeStateChangeFromDrain->CsvFsVolumeStateDraining, status 0x0]
00002e64.00000ac0::2013/07/23-13:46:37.849 INFO  [DCM] FilterAgent: MarkBad: volume NSW_NSWHYPERV-01_06_T23_DR, state ActiveDirectIO, csvfs state CsvFsVolumeStateDraining, status CsvFsVolumeStateDraining
00002e64.00000ac0::2013/07/23-13:46:37.849 INFO  [DCM] FilterAgent: MarkBad() Volume \\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\ has gone into draining state
00002e64.00000ac0::2013/07/23-13:46:37.851 INFO  [DCM] MappingManager::PauseVolume 'NSW_NSWHYPERV-01_06_T23_DR'
00002e64.00000ac0::2013/07/23-13:46:37.851 INFO  [DCM] FilterAgent: pausing volume bcd163b1-00bd-4aa0-b524-9e3758f106ec, target \\?\GLOBALROOT\Device\Harddisk3\ClusterPartition2\
00002e64.00000ac0::2013/07/23-13:46:37.851 INFO  [DCM] FilterAgent: ChangeCsvFsState: uniqueId bcd163b1-00bd-4aa0-b524-9e3758f106ec, state CsvFsVolumeStatePaused, sequence <4:10><24>
00002e64.00001a4c::2013/07/23-13:46:37.881 INFO  [DCM] CsvFs Listener: state [volume bcd163b1-00bd-4aa0-b524-9e3758f106ec, sequence <4:10><24>, state CsvFsVolumeStateChangeFromPause->CsvFsVolumeStatePaused, status 0x0]
00002e64.00000ac0::2013/07/23-13:46:37.881 INFO  [DCM] short name is C:\CLUSTE~1\NSW_NS~4
00002e64.00000ac0::2013/07/23-13:46:37.881 INFO  [DCM] PauseNFilter bcd163b1-00bd-4aa0-b524-9e3758f106ec, NTFS: 00000000-0000-0000-0000-000000000000
00002e64.00000e24::2013/07/23-13:46:37.881 INFO  [DCM] Set NfltPauseComplete event for  bcd163b1-00bd-4aa0-b524-9e3758f106ec
00002e64.00000ac0::2013/07/23-13:46:37.881 INFO  [DCM] PauseNFilter completed for bcd163b1-00bd-4aa0-b524-9e3758f106ec
00002e64.00000ac0::2013/07/23-13:46:37.881 INFO  [DCM] FilterAgent: PauseSnapshots completed for NSW_NSWHYPERV-01_06_T23_DR:bcd163b1-00bd-4aa0-b524-9e3758f106ec
00002e64.00000ac0::2013/07/23-13:46:37.881 INFO  [DCM] volume paused 'NSW_NSWHYPERV-01_06_T23_DR'
00002e64.000031a4::2013/07/23-13:46:37.881 INFO  [DCM] PauseDisk completed for resource 'NSW_NSWHYPERV-01_06_T23_DR'

C. After several minutes, the CSV comes online.
00002e64.00000ac0::2013/07/23-13:49:15.602 INFO  [DCM] ActivateNetworkPath bcd163b1-00bd-4aa0-b524-9e3758f106ec completed
00002e64.00000ac0::2013/07/23-13:49:15.602 INFO  [DCM] ActivateVolume: NSW_NSWHYPERV-01_06_T23_DR:bcd163b1-00bd-4aa0-b524-9e3758f106ec
00002e64.00000ac0::2013/07/23-13:49:15.602 INFO  [DCM] Bitlocker not installed or loaded
00002e64.00000ac0::2013/07/23-13:49:15.602 INFO  [DCM] FilterAgent: ChangeCsvFsState: uniqueId bcd163b1-00bd-4aa0-b524-9e3758f106ec, state CsvFsVolumeStateActive, sequence <><31>
00002e64.00001a4c::2013/07/23-13:49:15.602 INFO  [DCM] CsvFs Listener: state [volume bcd163b1-00bd-4aa0-b524-9e3758f106ec, sequence <><31>, state CsvFsVolumeStateChangeFromActivate->CsvFsVolumeStateActive, status 0x0]

During a period of up to 15 minutes, VM's would be hung or restarted. The recovery of the VM's would vary and in some extreme cases we actually had data corruption.

Needless to say we got Microsoft Premier Support involved.

We were quickly advised to apply: http://support.microsoft.com/kb/2813630/en-us
Disable ODX in our environment:
Set-ItemProperty hklm:\system\currentcontrolset\control\filesystem -Name "FilterSupportedFeaturesMode" -Value 1

However we were reluctant to do this as we in fact had a nice new storage array that supports this super cool functionality. We applied the hotfix and didn't disable ODX (as mentioned above).

Unfortunately this didn't resolve our issues...DAMN!

We then came across the following hotfix's and also applied these to all hosts:
KB2796995
KB2838043 


Unfortunately this didn't resolve our issues either...DAMN!

We then went and tried to update the Broadcom Network Drivers on all hosts
No luck either!

Finally we applied the following hotfix (KB2870270):
http://support.microsoft.com/kb/2870270

Horaaa!!!! This was the lucky hotfix. This took over a month to resolve and was a very time consuming and frustrating process. I hope this helps anyone else out there who is currently going through similar pain...







1 comment:

  1. hola amigos que bien este blog gracias por su trabajo y si quiere comprar replica reloj puede buscar aca replika saatler

    ReplyDelete