Domain controller failover fails unidirectionaly

active-directorydomain-controllerwindows-server-2012-r2windows-server-2019

The problem I am having is that if I take one of my two writable domain controllers offline, nobody seems to "fail over" to using the other domain controller like they're supposed to – applications we run within our network that use AD for authentication just keep asking for a username and password and never actually authenticate you, and external users reliant on a read-only DC on another network segment can't authenticate to our remote access website either.

I currently have three domain controllers in my Domain: DC1, DC2, and RO1. DC1 and RO1 is Server 2019, DC2 is Server 2012R2. Both of the writable DCs are AD-integrated DNS servers, with their network adapters configured to point at each other.

DC1 and DC2 are on the same subnet. RO1 is a read only controller out in a different network segment in order to support a remote access solution managed by the organization above me (who manages the general network I connect to).

In the past, if I were to take one or the other local DCs offline, local users would fail over to whichever was actually still running (as expected), as would remote users as the RODC fetches the active one to authenticate.

The current DC1 is a relatively new addition, replacing one called DC. DC1 was brought online and joined with DC and DC2, and everything seemed fine. I transferred all the FSMO roles that DC had over to its replacement, DC1 – netdom query fsmo shows all roles as being on the new DC1. We demoted and took DC offline to retire it since it was a Server 2012 machine and we're migrating away from those. Cleaned up a few errant DNS records that claimed the old DC was still around, but other than that everything chugged along as it had. Last patch cycle though, we had DC2 offline while the DC1 and RO1 remained active, but discovered the authentication related issues above. External users could not authenticate in at all, and users who were already logged in found our AD-authenticating applications suddenly asking them to log in again (to no avail).

Unfortunately I'm not sure why this is. DC1, the new controller, is definitely recognized by the Domain. Replication happens fine – Repadmin /showrepl is successful, and /replsum has no errors reported. All involved internal machines can resolve their host names and ping each other. If I ping the domain, I can get either writable DC, same as if I tracert to the domain. I can make edits on DC1 and see them on DC2, and vice versa (and changes like group policy made on DC1 specifically definitely exist out in the greater network). I can take the RODC and tell it to load records from DC1 and DC2 without issue.

If I take DC2 offline, however, that's when things go sideways. Ping or Tracert to our domain fails, external users get denied access, and internal users see our AD-authenticated applications fail and constantly call for a username and password. The opposite does not happen, however – if I take the new DC1 offline, local users sometimes have a slight chugging delay as if their machine was trying to contact DC1 before failing over to DC2 and authenticating successfully, and external users come in just fine.

There's nothing super obvious in the Event Logs, and everything I can think of appears correctly configured. I'm not sure where to progress from here – has anyone had similar symptoms that they've been able to correct?

Best Answer

The problem ended up being related to firewall settings managed exclusively by the organization that manages the network that we connect to. Some inbound/outbound rules were not correctly applied, resulting in hosts being unable to correctly fail over to the new Domain Controller in the event that the older one had gone offline.