Intermittent logon failures or lock out failures when under high impersonation load

active-directorycifsisilonwindows-server-2003

Recently we've been seeing an interesting slew of failures on our cluster where users' jobs will intermittently fail with logon errors, account locked out errors, or file permission errors.

Our cluster is loosely-coupled and coarse-grained, built around 40 16-way Windows 2003 machines. They participate in a corporate domain, with domain controllers locally and on the WAN. Job submission is handled through a 3rd party application (ActiveBatch) and file storage is split between a SAN exported by a Windows 2003 server and a newer CIFS share on an Isilon cluster.

Jobs are directed-acyclic graphs, consisting of 1 to 5,000 processes, scheduled on a head node through ActiveBatch. Most jobs are tiny batch files or Perl scripts which perform environment setup for computational codes written in FORTRAN. Input and output files for these jobs are stored on either the SAN or the Isilon.

What we have been seeing are intermittent failures in authentication, which originally we believed to be isolated on the Isilon. The general failure mode is 100-200 jobs would begin execution, each referencing common configuration data in a file. The majority would succeed, however, multiple jobs on multiple machines would fail on the client side with a file permissions error (either 0x775 'The referenced account is currently locked out…' or 0x52E 'unknown user name or bad password').

Checking the event logs for these time periods reports 0 Security Audit Failures, but multiple Security Audit Successes for the same user! The only event log entry in close proximity is a 6013 event letting us know, "The system uptime is 2199088 seconds."

Recently we've also been seeing the same error when the job scheduling software attempts to create the jobs on the remote machines. ActiveBatch will send the job details to a service running on the machine which then attempts impersonation of the user when it creates the job. As with the file permission failures, we're seeing both account lock outs and unknown user/password when the user's account is neither locked out nor unknown (and in fact processes on the same machine succeeded shortly after these failed attempts).

I'm not familiar enough with domain controllers, nor given sufficient access to explore, to know whether or not this is a client side problem or a server side problem. The lack of client side event log failure entries leads me to believe the failure is perhaps a DC timeout or a network issue. However, a Wireshark interrogation of traffic between a random server and the DC did not reveal any gross inconsistencies beyond the occasional Kerberos Response Too Big messages.

Is this a common problem with domain controller setups where high authentication/impersonation load causes transient failures?

Best Answer

It isn't common, unless there is something generating the failure that would result in the lockout.

Enabling Netlogon verbose logging may help track it down.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters]  
"DBFlag"=dword:24401F04  

The files created are %systemroot%\debug\netlogon.log and netlogon.bak.

These can roll over quickly in a high-volume environment, so you may need to increase the size of the files, which is 20 MB by default. To increase it to 50 MB:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters]
"MaximumLogFileSize"=dword:3200000  

Enabling debug logging for the Net Logon service
http://support.microsoft.com/kb/109626

Related Topic