R – Windows Workflow: Why do they get stuck in persistence

workflow-foundation

I have a Windows Workflow instance that's using SQL persistence, being hosted in the web runtime, since the workflows are started by ASP.NET form submissions. It runs great most of the time, but I've noticed instances where I have to kick things:

I notice the nextTimer has gone way overdue, even by hours. Sometimes the ownerID and ownedUntil fields are null in the persistence database, sometimes not. The "unlocked" and "blocked" fields are always both "1".

…and then the workflow runtime doesn't pick it back up until I null out the "owner" fields if they're populated and kick the application pool with a recycle, and things go along just fine after that for the most part. There are no errors (I have try/catch blocks around everything and write out anything caught into a trace file), so that's not it.

The delay activities causing the persistence are all set to one minute, and the ownership duration for the runtime is 60 seconds as well. The code that it gets stuck on should always take less than a minute.

As I write this, I'm curious if recycles of the app pool/app domain are causing it…when the workflow tries to call whatever method in the runtime, it's busy spinning up the app domain/pool and might leak over the 60 seconds ownership duration. That sound remotely plausible, and would that cause it to not rehydrate properly?

Barring that sidetrack, what could cause this behavior I'm seeing? I don't want to babysit the runtime every day by unsticking stuck workflows.

Best Answer

Its quite likely that the app domain recycling is a large part of your problem. IIS will recycle an AppDomain as soon as the last request is finished. It does not however see code running on another thread as part of that request. That is one of main reasons for using the ManualWorkflowSchedulerService when hosting in IIS. But when you use the active timers option it still uses a background thread to execute workflow activities.

Also make sure you unload workflows as soon as they go idle. The easiest way of doing so is using the UnloadOnIdle setting on the SqlWorkflowPersistenceService.

The PersistenceService checks for workflows with an expired ownership but only at startup time. So most likely restarting the IIS worker process will also restart old workflows without any extra work. But as this is the case of new problems..... Just clearing out the old ownership should also do the trick. In that case the PersistenceService should just reload the workflows at the next time. The only trick is to know which runitme ID is old and which isn't (the property holding the value is not public).

Another thing to make sure of is that the IIS worker process is reloaded. If this isn't done there is no WF runtime so it cannot check for expired timers. It sounds like you have this covered but just in case.

Related Topic