Exchange 2013 database corruption with event 476 in ESE

corruptiondell-powervaultexchangeexchange-2013vmware-vsphere

I'm having some random database corruptions on our Exchange Server 2013 with event 476 on ESE. This is the fifth time that this happens and the situation in already unacceptable. Here's an screenshot of Event Viewer with the incident.

ESE Database Corruption

The recovery procedure must be done from backups or done by eseutil /p which is a lossy procedure since the logs got corrupted too.

At this point I really want to isolate the problem and find which device I should blame. This Exchange Server is running inside a VM in vSphere 6.0. The VMDK is exported through iSCSI from a Dell Powervault MD3820i.

Due to the nature of the error, it appears to be a problem with the storage subsystem, but how can we investigate this? On the previous issues the folks on DELL said that everything was fine in the storage, but I don't know if the diagnostics run by them are trusty enough.

Thanks in advance,

EDIT: There are no AntiVirus software installed on the server. The host hardware running VMware vSphere 6.0 is a DELL PowerEdge R730 homologated from DELL to run vSphere. There are no errors on VMware or anything like this on the logs, or at least I wasn't able to find any issue on the logs.

Storage communication is done by iSCSI using two Cat6 cables in multipath mode with dual controllers on the PowerVault MD3820i, so it's a pretty default configuration and know to work, and again, it was homologated by DELL.

I know that things homologated by DELL doesn't mean that's good. But they sold the hardware and they recommended their best practices, and we followed all of them.

EDIT II: The PowerVault storage appliance is running the latest firmware from DELL, the version 08_20_09_60 which is one older than the latest has addressed one particular issue that leads to data corruption: Addressed a rare condition which has the potential of causing a processor fault that could result in a data integrity issue

About the network cards, we're using a dual Broadcom NetXtreme II BCM57810 10GbE. The card does not support TCP engine offloading and/or iSCSI offloading so this should not be the issue.

VMware is running with the recommended drivers for the local SAS controllers too: the megaraid_sas driver instead of the deafault tg3 bundled with VMware. I don't think this is could be the issue since the VM's are on iSCSI Storage and not on the local storage.

Best Answer

As it says in the event log error description, this will almost certainly be a fault with the system hardware, which can be a rather nebulous concept when talking about virtual guests.

I would be looking very hard at the storage subsystem - Given my recent experiences with virtual clusters built on Dell servers I would suspect either an issue with network card firmware or storage system firmware in that order.

Having had a cup of tea and a think, I've looked again at your error, you're getting a 1019 error. This is specifically saying that the exchange server went to read some data in the database that it 'knew' had been written but was unable to find it (have you read https://support.microsoft.com/en-gb/kb/314917 - the errors are discussed there in some detail).

This can only be disk corruption of some kind and the root cause for that is very likely to be an issue with the storage system, especially considering that you mention this has happened before.

My other worry at this point is that 1019 errors can be rather insidious; it could be the end result of a write going wrong some time ago not being detected because the data wasn't needed for some time. Restoring yesterday's backup won't help if the corruption occurred last week, for example.

At this point, I'd be certainly contacting Dell and also, maybe, Microsoft.