Domain controller offline over 2 months, now can’t sync

active-directory

Short Version

Domain controller was setup, then taken offline for longer than the tombstone limit. Now I can't get it to replicate again.

Relevant Error Messages

On dc2 (identical error messages exist about both exchange and dc1):

The kerberos client received a KRB_AP_ERR_MODIFIED error from the server host/exchange.mydomain.local. The target name used was exchange$@MDOMAIN.LOCAL. This indicates that the password used to encrypt the kerberos service ticket is different than that on the target server. Commonly, this is due to identically named machine accounts in the target realm (MYDOMAIN.LOCAL), and the client realm. Please contact your system administrator.

Another relevant error (Event ID 2042):

The Knowledge Consistency Checker (KCC) has detected that successive attempts to replicate with the following domain controller has consistently failed.
Attempts:
12
Domain controller:
CN=NTDS Settings,CN=DC1,CN=Servers,CN=MainSite,CN=Sites,CN=Configuration,DC=mydomain,DC=local
Period of time (minutes):
105103
The Connection object for this domain controller will be ignored, and a new temporary connection will be established to ensure that replication continues. Once replication with this domain controller resumes, the temporary connection will be removed.
Additional Data
Error value:
2148074274 The target principal name is incorrect.

And Event ID 1925:
The attempt to establish a replication link for the following writable directory partition failed.

Other Details

Both sites are connected through a VPN. At the main site, I have two domain controllers (which we shall call exchange and dc1). Both are Server 2003. If it matters, dc1 holds all the FSMO roles.

In preparation for setting up a remote site, I setup a domain controller called dc2, running Server 2003 R2, and configured separate sites in AD Sites and Services, and configured replication from dc1 to dc2. I even had it the correct subnet for the remote site by connecting it through a router (this was before the site was connected to the VPN, so no IP conflicts).

Everything was working great, so I shut down and got it ready to take out. But things kept getting delayed for over 2 months, and now dc2 won't replicate properly.

What I've tried

Removing the domain controller role – fails with:
Managing the network session with DC1.mydomain.com failed "Logon Failure: The target account name is incorrect."


Resetting the machine password with:

Disable and stop KDC service

klist /purge

netdom resetpwd /s:dc1 /ud:domainadmin /pd:domainadminpassword

Reboot

Reenable KDC service


Most of the KB articles I went through about fixing replication after reaching the tombstone life got stuck because of the "The target principal name is incorrect" error.

Best Answer

It seems the easiest way is indeed to remove active directory and reinstall it, and it can be done without wiping out the entire server. This leaves anything else on the server untouched. However, since you can't remove active directory properly, you have to force it to be removed from the server then cleanup manually on a good domain controller.

  • Disconnect the problem server from the network to prevent any of this from potentially breaking active directory on the good servers.

  • On the problem server, run dcpromo /forceremoval. This allows you to remove active directory on the system without removing all it's records on the other domain controllers.

  • Use ntdsutil from a good domain controller to remove the problem server from active directory. Instructions are in the help link when you run dcpromo /forceremoval, or here: http://technet.microsoft.com/en-us/library/cc736378%28WS.10%29.aspx

  • Delete the server object in AD Sites and Services

  • Delete the server in AD Users and Computers if it still exists

  • Delete the server from DNS:

    • Remove the NS entry in reverse lookup zones
    • Remove the A entry in forward lookup zones
    • Remove the CNAME entry in forward lookup\domain_msdcs
    • Remove the numerous SRV records under _msdcs, _sites, _tcp and _udp refering to the problem server
  • Repromote the problem server and configure site settings like you would a brand new DC.