Windows – Server 2012 Data Deduplication Skipping Replica VHDs

deduplicationpowershellreplicationwindows

I'm currently trying to use data deduplication on two seperate Windows Server 2012 Datacenter edition hyper-v hosts. On one, I am trying to dedupe replicas that are still being resync'd every 5 minutes or so. On the other, I have stopped the resync with a powershell script on about 15 servers (4 terabytes of data) and moved them to the root of the volume that I have deuplication enabled on.

Now for some reason, it works with anything I put in there except Replica VHD images. It just skips them.

I put 50 gigs of templates and isos and it worked great, I initiate the deduplication like so:

Start-DedupeJob -Full -Path R: -Type Optimization

It works great normally but the actual reason I'm using it in the first place is to reduce the space required to store a snapshot of the replica VHD. I would prefer to be able to have the hyper-v host resync the VHDs and have the deduplication going but if I have to remove the sync and then dedupe and then unoptimize to resync or something that is fine with me, I can just script it out, but right now under no circumstances can I get these to dedupe the replica vhds!!! It's driving me crazy!

Any advice, suggestions, would be greatly appreciated.

UPDATE:

I have two VHDs, one is from a template and the other is a replica image of a 1.6 terabyte data drive on another vm on another hyper-v server host.

I've matched all the file properties and permissions to be identical including ownership. The only thing is the file that does work with deduping is flagged as Attribute APL and the one that is not doing it is just Attribute A – I am not sure what P and L are and I don't believe I can set it with attrib.exe.

So crazy – no replica vhds will dedupe what so ever!

UPDATE:

The script I am using to optimize the vhds is

$vhds = Get-ChildItem -Recurse | ? {$_.extension -match "vhd"}

foreach ($vhd in $vhds) {

Mount-VHD -Path $vhd.fullname -Verbose -ReadOnly

Optimize-VHD -path $vhd.fullname -Verbose -Mode Retrim

Dismount-VHD -path $vhd.fullname -Verbose

}

I have ran that and noticed it is taking a little longer for the dedupe process to finish but there is still no deduplication going on with the Replication VHDs. This is very strange to me – I was hoping if something was flagging the file as 'open' it was not do so anymore after the optimize-vhd runs. The VHDs in question have not been written to for awhile now. I used this script to turn off resync on the host to stop the writes:

$vmlist = get-vm * | where {$_.replicationstate -eq "replicating" -and $_.state -eq     "Running"}

foreach ($vm in $vmlist) {

$vmname = $vm.name
set-vmreplication -vmname $vmname -AutoResynchronizeEnabled $false

}

Best Answer

I suspect your replica VHDs are either constantly open with a write lock or too frequently written to be covered by the MinimumFileAgeDays setting (5 days by default, can be set as low as 0 with Set-Dedupvolume <Drive>: -MinimumFileAgeDays 0).

By the way, the documentation clearly declares such a configuration "unsupported":

Unsupported configurations

Constantly open or changing files

Deduplication is not supported for files that are open and constantly changing for extended periods of time or that have high I/O requirements, for example, running virtual machines on a Hyper-V host, live SQL Server databases, or active VDI sessions.

Deduplication can be set to process files that are 0 days old and the system will continue to function as expected, but it will not process files that are exclusively open. It is not a good use of server resources to deduplicate a file that is constantly being written to, or will be written to in the near future. If you adjust the default minimum file age setting to 0, test that deduplication is not constantly being undone by changes to the data.

Deduplication will not process files that are constantly and exclusively open for write operations. This means that you will not get any deduplication savings unless the file is closed when an optimization job attempts to process a file that meets your selected deduplication policy settings.

And thus also contains the following recommendation:

Not good candidates for deduplication:

  • Hyper-V hosts
  • VDI VHDs
  • WSUS
  • Servers running SQL Server or Exchange Server
  • Files approaching or larger than, 1 TB in size

It looks a bit like what you are seeking for is online deduplication which dedupes data as it is being written to disk. This is a feature of some more sophisticated SAN solutions (including Nexenta's SMB-targeted offerings), but comes at a rather high cost for the silicon - you would need a powerful machine with a lot of RAM to have online dedup run smoothly.

Related Topic