Netlogon – Domain Trust Secure Channel issues – Only on some DCs

domain-controllernetlogontrust-relationship

We have a 2 domain environment. We were having issues with slow connections, authentication failures, and hung resources only during OFF-PEAK hours when there were very few users logged on.

The issue occurred when a user from DOMAIN A is accessing a resource located on DOMAIN B and is using ntlm authentication. There are no issues with users from DOMAIN A accessing resources in DOMAIN A, or with users from DOMAIN B accessing resources in DOMAIN B.

We were able to track down the problem to the secure channels that are used for netlogon traffic. When a resource from domain B had a secure channel with one particular DC (I'll call it DC-B1), then everything worked fine. We can follow the traffic chain from client(A)->resource(B)->DC-B1(B)->DC-A1(A) (for authentication) and then back again. However, if the resource server in B had a secure channel with any of the other DC's in DOMAIN B, the authentication would hang and never complete.

So it looks like with the exception of DC-B1, every DC in DOMAIN B is having trouble talking creating a domain trust secure channel with DOMAIN A. To test, we ran nltest /sc_verify:DOMAINA from each DC in DOMAIN B.

When run from DC-B1, the response was instantaneous. When run from any other DC on domain B, it hung for about 40 seconds before showing a success (never showed an error, just took a long time).

Any ideas on why some DC's would be struggling with establishing and using the domain trust secure channel and another DC in the same domain never has an issue?

For what it's worth, the DC that works is server 2008, the ones that don't work are server 2012 R2, however the problem existed on some domain controllers before migrated to 2012 R2, we just didn't pin-point the issue until after we were done migrating them.

Thanks for the help.

Edit: Additional Information…

Compared a weekend's worth of NetLogon.log files for each of the Domain Controllers…

Every

[LOGON] SamLogon: Transitive Network logon of DOMAINA\testuser Entered

record in the DC-B1 log (this is the good DC) had a corresponding

[LOGON] SamLogon: Transitive Network logon of DOMAINA\testuser Returns 0x0

however on the other DCs in Domain B each return had one of the following 3 errors:

[LOGON] ... DOMAINA\testuser ... Returns 0xC0020017
[LOGON] ... DOMAINA\testuser ... Returns 0xC0020050
[LOGON] ... DOMAINA\testuser ... Returns 0xC000005E

And here is how often each of the different errors occured:

77% of errors were: 0xC0020017 RPC SERVER UNAVAILABLE
21% of errors were: 0xC0020050 RPC CALL CANCELED
 1% of errors were: 0xC000005E NO LOGON SERVERS AVAILABLE
 0% of returns were: 0x0 (no error)

We compared the all the security setting between the DCs that do not work and the one that does but couldn't find anything that would cause the RPC issues. Any suggestions on where we could look next? We are confused as to why the 2008 domain controller in "B" would have no trouble talking to 2012 DCs in "A", but the 2012 Dcs in "B" cannot use pass through authentication to "A".

Edit: Additional Requested Information…

Test run from DC-B2 & DC-B3 (same results)
(pass through authentication originating here does not work)

C:\>nltest /dsgetdc:DOMAINA.local
           DC: \\DC-A3.DOMAINA.local
      Address: \\555.555.555.127
     Dom Guid: 9f3a0668-c245-4493-be03-0f7edf534d27
     Dom Name: DOMAINA.local
  Forest Name: DOMAINA.local
 Dc Site Name: Company
Our Site Name: Company
        Flags: GC DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST CLOSE_SITE FULL_SECRET WS DS_8 DS_9
The command completed successfully

Edit: Additional Information…

Results from PortQry from Domain B -> Domain A (GC DC)

TCP port 135  (epmap service):      LISTENING
TCP port 389  (ldap service):       LISTENING
UDP port 389  (unknown service):    LISTENING or FILTERED
TCP port 636  (ldaps service):      LISTENING
TCP port 3268 (msft-gc service):    FILTERED
TCP port 3269 (msft-gc-ssl service):    FILTERED
TCP port 53   (domain service):     NOT LISTENING
UDP port 53   (domain service):     NOT LISTENING
TCP port 88   (kerberos service):   LISTENING
UDP port 88   (kerberos service):   LISTENING or FILTERED
TCP port 445  (microsoft-ds service):   LISTENING
UDP port 137  (netbios-ns service):     LISTENING or FILTERED
UDP port 138  (netbios-dgm service):    LISTENING or FILTERED
TCP port 139  (netbios-ssn service):    LISTENING
TCP port 42   (nameserver service):     FILTERED

Best Answer

After taking Greg's advice and focusing on the firewall we found the solution. At some point in the past, a firewall rule had changed and the dynamic port range (49152-65535) was being filtered. Once the network guys added the rule to allow dynamic ports from DOMAIN B to DOMAIN A the issue was completely resolved.

For some reason in server 2008, this would only cause issues at the time the secure channel is being created. It would take 21 seconds (or some multiple of 21 seconds) to create the secure channel. After the secure channel was established, the authentication worked fine. The 21 second delay makes sense due to the nature of TCP communication.

In Server 2012 R2, the behavior was different. Regardless of whether the secure channel was established accross domains, it would fail to authenticate and break the secure channel to go look for another available domain controller.

I'm not sure why this worked at all in Server 2008... maybe it was defaulting to another port somewhere when it failed to establish a connection in the ephemeral ports?

In any event we've learned a valuable lesson: "This (filtered ports) should be the first item to check if there are RPC connectivity issues" - Greg Askew