Python Workflow Design Pattern

pythonworkflow

I'm working on a piece of software design, and I'm stuck between not having any idea what I'm doing, and feeling like I'm reinventing the wheel.

My situation is the following: I am designing a scientific utility with an interactive UI. User input should trigger visual feedback (duh), some of it directly, i.e. editing a domain geometry, and some of it as soon as possible, without blocking user interaction, say, solving some PDE over said domain.

If I draw out a diagram of all operations I need to perform, I get this rather awesomely dense graph, exposing all kinds of opportunities for parallelism and caching/reuse of partial results. So what I want is primarily to exploit this parallelism in a transparent way (selected subtasks executing in seperate processes, results outmatically being 'joined' by downstream tasks waiting for all their inputs to be ready), plus only needing to recompute those input branches that actually have their input changed

pyutilib.workflow seems to come closest to being what I'm looking for, except of course that it isn't (doesn't seem to do any subprocessing to begin with). That seems rather disappointing; while I'm not a software engineer, id say I'm not asking for anything crazy here.

Another complicating factor is the tight user-interface integration I desire, which other scientific-workflow-solutions seem not designed to handle. For instance, I would like to pass a drag-and-drop event through a transformation node for further processing. The transformation node has two inputs; an affine transform state input port, and a pointset class that knows what to do with it. If the affine transform input port is 'dirty' (waiting for its dependencies to update), the event should be held up until it becomes available. But when the event has passed the node, the eventinput port should be marked as handled, so it does not refire when the affine transform changes due to further user input. That's just an example of one of the many issues that come up that I don't see being adressed anywhere. Or what to do when a long-running forking-joining branch receives new input while it is in the middle of crunching a previous input.

So my question: Do you happen to know of some good books/articles on workflow design patterns that I should read? Or am I trying to fit a square peg into a round hole, and you know of a completely different design pattern that I should know about? Or a python package that does what I want it to, regardless of the buzzwords it comes dressed up in?

Ive rolled by own solution on top of enthought.traits, but I'm not perfectly happy with that either, as it feels like a rough and shoddy reinvention of the wheel. Except that I cant seem to find any wheels anywhere on the internet.

NOTE: I'm not looking for webframeworks, graphical workflow designers, or any special-purpose tools. Just something conceptually like pyutilib.workflow, but including documentation and a featureset that I can work with.

#
#
#
EDIT: this is where I'm at after more reading and reflection on the issue:
#
#
#

The requirements one can tack onto a 'workflow architecture' are too diverse for there to be a single shoe that fits all. Do you want tight integration with disk storage, tight integration with web frameworks, asynchronicity, mix in custom finite state machine logic for task dispatch? They are all valid requirements, and they are largely incompatible, or make for senseless mixes.

However, not all is lost. Looking for a generic workflow system to solve an arbitrary problem is like looking for a generic iterator to solve your custom iteration problem. Iterators are not primarily about reusability; you cant reuse your red-black-tree iterator to iterate over your tensor. Their strength lies in a clean separation of concerns, and definition of a uniform interface.

What I'm looking for (and have started writing myself; its going to be pretty cool) will look like this: at its base is a general implementation-agnostic workflow-declaration mini-language, based on decorators and some meta-magic, to transform a statement like the below into a workflow declaration containing all required information:

@composite_task(inputs(x=Int), outputs(z=Float))
class mycompositetask:
    @task(inputs(x=Int), outputs(y=Float))
    def mytask1(x):
        return outputs( y = x*2 )
    @task(inputs(x=Int, y=Float), outputs(z=Float))
    def mytask2(x, y):
        return outputs( z = x+y )
    mytask1.y = mytask2.y   #redundant, but for illustration; inputs/outputs matching in name and metadata autoconnect

What the decorators return is a task/compositetask/workflow declaration class. Instead of just type constraints, other metadata required for the workflow-type at hand is easily added to the syntax.

Now this concise and pythonic declaration can be fed into a workflow instance factory that returns the actual workflow instance. This declaration language is fairly general and probably need not change much between different design requirements, but such a workflow instantiation factory is entirely up to your design requirements/imagination, aside from a common interface for delivering/retrieving input/output.

In its simplest incarnation, wed have something like:

wf   = workflow_factory(mycompositetask)
wf.z = lambda result: print result   #register callback on z-output socket
wf.x = 1    #feed data into x input-socket

where wf is a trivial workflow instance, which does nothing but chain all contained function bodies together on the same thread, once all inputs are bound. A quite verbose way to chain two functions, but it illustrates the idea, and it already achieves the goal of separating the concern of keeping the definition of the flow of information in a central place rather than spread all throughout classes that would rather have nothing to do with it.

That's more or less the functionality I've got implemented so far, but it means I can go on working on my project, and in due time ill add support for fancier workflow instance factories. For instance, I'm thinking of analyzing the graph of dependencies to identify forks and joins, and tracking the activity generated by each input supplied on the workflow-instance level, for elegant load balancing and cancellation of the effects of specific inputs that have lost their relevance but are still hogging resources.

Either way, I think the project of separating workflow declaration, interface definition, and implementation of instantiation is a worthwhile effort. Once I have a few nontrivial types of workflow instances working well (I need at least two for the project I'm working on, I've realized*), I hope to find the time to publish this as a public project, because despite the diversity of design requirements in workflow systems, having this groundwork covered makes implementing your own specific requirements a lot simpler. And instead of a single bloated workflow framework, a swiss army knife of easily switched-out custom solutions could grow around such a core.

*realizing that I need to split my code over two different workflow instance types rather than trying to bash all my design requirements into one solution, turned the square peg and round hole I had in my mind into two perfectly complementary holes and pegs.

Best Answer

I believe that you are both, right and wrong, in doubt of re-inventing the wheel. Maybe different levels of thinking gives you a hint here.

How to eat an elephant?

Level A: software design

At that level, you would want to stick to the best practice that no long operations are done in the UI (and UI thread). You need an UI layer that focuses only on gathering input (including cancellation) and drawing (including in-progress-visualization like progress-bar or hour-glass). This layer should be separated from anything else as dusk and dawn. Any call outside of this layer must be fast if you want intuitiveness and responsiveness.

In tasks as complex as yours, the calls outside of the UI layer are typically:

Schecule some work - the command should be queued to smart layer for it to pick up whenever it gets to it.
Read results - the results should be queued in the smart layer so they could just be "popped out" and rendered.
Cancel/stop/exit - just raise a flag. Smart layer should check this flag now and then.

Don't worry too much that some user operations are getting reaction too slowly - if you have a solid design core then you can adjust priorities of the user input later on. Or add a short-term hour-glass or similar. Or even cancel all long operations that get obsolete after a specific user input.

Level B: the heavy-lifting smart layer

There is no "best" framework for "any" kind of hard work.

So I'd suggest you to design the feeding (by UI) of this layer as simple as possible with no frameworks involved.

Internally, you can implement it using some frameworks but you will have the ability in future to redesign the hard-working elements as needed. For example, in future you could:

give some math to GPU
share tasks to server-farms
involve cloud computing

For complex tasks, picking a framework at the top level of design might prove as an obstacle in the future. Specifically, it may limit your freedom of applying other technologies.

It's hard to tell for sure but it seems to me that you don't have a silver bullet framework for your task. So you should find strong tools (e.g. threads and queues) to implement good design practices(e.g. decoupling).

EDIT as a response to your edit

Your latest edit stresses perfectly the hard challenges that a software designer meets. For your case, the acceptance that there is no silver bullet. I'd suggest you to accept it sooner - better than later...

The hint resides in that you offered the most generic task to be defined by Int's and Float's. This could make you happy for today but it will fail tomorrow. Exactly as locking-in to a super-abstract framework.

The path is right - to have a heavy-lifting "task" base in your design. But it should not define Int or Float. Focus on the above mentioned "start", "read" and "stop" instead. If you don't see the size of the elephant that you are eating then you might fail eating it and end up starving :)

From Level A - design perspective - you could define task to contain something like this:

class AnySuperPowerfulTask:
    def run():
        scheduleForAThreadToRunMe()
    def cancel():
        doTheSmartCancellationSoNobodyWouldCrash()

This gives you the basis - neutral, yet clean and decoupled by Level A (desing) perspective.

However, you would need some kind of setting up the task and getting the real result, right? Sure, that would fall into Level B of thinking. It would be specific to a task (or to a group of tasks implemented as an intermediate base). Final task could be something along these lines:

class CalculatePossibilitiesToSaveAllThePandas(SuperPowerfulTask):
    def __init__(someInt, someFloat, anotherURL, configPath):
        anythingSpecificToThisKindOfTasks()
    def getResults():
        return partiallyCalculated4DimensionalEvolutionGraphOfPandasInOptimisticEnvoronment()

(The samples are intentionally incorrect by python in order to focus on the design, not syntax).

Level C - abstraction-nirvana

It looks like this level should be mentioned in this post.

Yes, there is such a pitfall that many good designers can confirm. The state where you could endlessly (and without any results) search for a "generic solution", i.e. the silver bullet). I suggest you to take a peek into this and then get out fast before it's too late ;) Falling into this pitfall is no shame - it's a normal development stage of the greatest designers. At least I'm trying to believe so :)

EDIT 2

You said: "Im working on a piece of software design, and im stuck between not having any idea what im doing, and feeling like im reinventing the wheel."

Any software designer can get stuck. Maybe next level of thinking might help you out. Here it comes:

Level D - I'm stuck

Suggestion. Leave the building. Walk into the Cafe next corner, order the best coffee and sit down. Ask yourself the question "What do I need?". Note, that it is different from the question "What do I want?". Ping it until you have eliminated wrong answers and start observing the correct ones:

Wrong answers:

I need a framework that would do X, Y and Z.
I need a screwdriver that could run 200mph and harvest forest in a farm nearby.
I need an amazing internal structure that my user will actually never see.

Right answers (forgive me if I understood your problem wrong):

I need user to be able to give input to the software.
I need user to see that the calculations are in progress.
I need user to visually see the result of calculations.

Explanation

Say you have two dictionaries and you want to merge them into a new dictionary without altering the original dictionaries:

x = {'a': 1, 'b': 2}
y = {'b': 3, 'c': 4}

The desired result is to get a new dictionary (z) with the values merged, and the second dictionary's values overwriting those from the first.

>>> z
{'a': 1, 'b': 3, 'c': 4}

A new syntax for this, proposed in PEP 448 and available as of Python 3.5, is

z = {**x, **y}

And it is indeed a single expression.

Note that we can merge in with literal notation as well:

z = {**x, 'foo': 1, 'bar': 2, **y}

and now:

>>> z
{'a': 1, 'b': 3, 'foo': 1, 'bar': 2, 'c': 4}

It is now showing as implemented in the release schedule for 3.5, PEP 478, and it has now made its way into the What's New in Python 3.5 document.

However, since many organizations are still on Python 2, you may wish to do this in a backward-compatible way. The classically Pythonic way, available in Python 2 and Python 3.0-3.4, is to do this as a two-step process:

z = x.copy()
z.update(y) # which returns None since it mutates z

In both approaches, y will come second and its values will replace x's values, thus b will point to 3 in our final result.

Not yet on Python 3.5, but want a single expression

If you are not yet on Python 3.5 or need to write backward-compatible code, and you want this in a single expression, the most performant while the correct approach is to put it in a function:

def merge_two_dicts(x, y):
    """Given two dictionaries, merge them into a new dict as a shallow copy."""
    z = x.copy()
    z.update(y)
    return z

and then you have a single expression:

z = merge_two_dicts(x, y)

You can also make a function to merge an arbitrary number of dictionaries, from zero to a very large number:

def merge_dicts(*dict_args):
    """
    Given any number of dictionaries, shallow copy and merge into a new dict,
    precedence goes to key-value pairs in latter dictionaries.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

This function will work in Python 2 and 3 for all dictionaries. e.g. given dictionaries a to g:

z = merge_dicts(a, b, c, d, e, f, g)

and key-value pairs in g will take precedence over dictionaries a to f, and so on.

Critiques of Other Answers

Don't use what you see in the formerly accepted answer:

z = dict(x.items() + y.items())

In Python 2, you create two lists in memory for each dict, create a third list in memory with length equal to the length of the first two put together, and then discard all three lists to create the dict. In Python 3, this will fail because you're adding two dict_items objects together, not two lists -

>>> c = dict(a.items() + b.items())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'dict_items' and 'dict_items'

and you would have to explicitly create them as lists, e.g. z = dict(list(x.items()) + list(y.items())). This is a waste of resources and computation power.

Similarly, taking the union of items() in Python 3 (viewitems() in Python 2.7) will also fail when values are unhashable objects (like lists, for example). Even if your values are hashable, since sets are semantically unordered, the behavior is undefined in regards to precedence. So don't do this:

>>> c = dict(a.items() | b.items())

This example demonstrates what happens when values are unhashable:

>>> x = {'a': []}
>>> y = {'b': []}
>>> dict(x.items() | y.items())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Here's an example where y should have precedence, but instead the value from x is retained due to the arbitrary order of sets:

>>> x = {'a': 2}
>>> y = {'a': 1}
>>> dict(x.items() | y.items())
{'a': 2}

Another hack you should not use:

z = dict(x, **y)

This uses the dict constructor and is very fast and memory-efficient (even slightly more so than our two-step process) but unless you know precisely what is happening here (that is, the second dict is being passed as keyword arguments to the dict constructor), it's difficult to read, it's not the intended usage, and so it is not Pythonic.

Here's an example of the usage being remediated in django.

Dictionaries are intended to take hashable keys (e.g. frozensets or tuples), but this method fails in Python 3 when keys are not strings.

>>> c = dict(a, **b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: keyword arguments must be strings

From the mailing list, Guido van Rossum, the creator of the language, wrote:

I am fine with declaring dict({}, **{1:3}) illegal, since after all it is abuse of the ** mechanism.

and

Apparently dict(x, **y) is going around as "cool hack" for "call x.update(y) and return x". Personally, I find it more despicable than cool.

It is my understanding (as well as the understanding of the creator of the language) that the intended usage for dict(**y) is for creating dictionaries for readability purposes, e.g.:

dict(a=1, b=10, c=11)

instead of

{'a': 1, 'b': 10, 'c': 11}

Response to comments

Despite what Guido says, dict(x, **y) is in line with the dict specification, which btw. works for both Python 2 and 3. The fact that this only works for string keys is a direct consequence of how keyword parameters work and not a short-coming of dict. Nor is using the ** operator in this place an abuse of the mechanism, in fact, ** was designed precisely to pass dictionaries as keywords.

Again, it doesn't work for 3 when keys are not strings. The implicit calling contract is that namespaces take ordinary dictionaries, while users must only pass keyword arguments that are strings. All other callables enforced it. dict broke this consistency in Python 2:

>>> foo(**{('a', 'b'): None})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: foo() keywords must be strings
>>> dict(**{('a', 'b'): None})
{('a', 'b'): None}

This inconsistency was bad given other implementations of Python (PyPy, Jython, IronPython). Thus it was fixed in Python 3, as this usage could be a breaking change.

I submit to you that it is malicious incompetence to intentionally write code that only works in one version of a language or that only works given certain arbitrary constraints.

More comments:

dict(x.items() + y.items()) is still the most readable solution for Python 2. Readability counts.

My response: merge_two_dicts(x, y) actually seems much clearer to me, if we're actually concerned about readability. And it is not forward compatible, as Python 2 is increasingly deprecated.

{**x, **y} does not seem to handle nested dictionaries. the contents of nested keys are simply overwritten, not merged [...] I ended up being burnt by these answers that do not merge recursively and I was surprised no one mentioned it. In my interpretation of the word "merging" these answers describe "updating one dict with another", and not merging.

Yes. I must refer you back to the question, which is asking for a shallow merge of two dictionaries, with the first's values being overwritten by the second's - in a single expression.

Assuming two dictionaries of dictionaries, one might recursively merge them in a single function, but you should be careful not to modify the dictionaries from either source, and the surest way to avoid that is to make a copy when assigning values. As keys must be hashable and are usually therefore immutable, it is pointless to copy them:

from copy import deepcopy

def dict_of_dicts_merge(x, y):
    z = {}
    overlapping_keys = x.keys() & y.keys()
    for key in overlapping_keys:
        z[key] = dict_of_dicts_merge(x[key], y[key])
    for key in x.keys() - overlapping_keys:
        z[key] = deepcopy(x[key])
    for key in y.keys() - overlapping_keys:
        z[key] = deepcopy(y[key])
    return z

Usage:

>>> x = {'a':{1:{}}, 'b': {2:{}}}
>>> y = {'b':{10:{}}, 'c': {11:{}}}
>>> dict_of_dicts_merge(x, y)
{'b': {2: {}, 10: {}}, 'a': {1: {}}, 'c': {11: {}}}

Coming up with contingencies for other value types is far beyond the scope of this question, so I will point you at my answer to the canonical question on a "Dictionaries of dictionaries merge".

Less Performant But Correct Ad-hocs

These approaches are less performant, but they will provide correct behavior. They will be much less performant than copy and update or the new unpacking because they iterate through each key-value pair at a higher level of abstraction, but they do respect the order of precedence (latter dictionaries have precedence)

You can also chain the dictionaries manually inside a dict comprehension:

{k: v for d in dicts for k, v in d.items()} # iteritems in Python 2.7

or in Python 2.6 (and perhaps as early as 2.4 when generator expressions were introduced):

dict((k, v) for d in dicts for k, v in d.items()) # iteritems in Python 2

itertools.chain will chain the iterators over the key-value pairs in the correct order:

from itertools import chain
z = dict(chain(x.items(), y.items())) # iteritems in Python 2

Performance Analysis

I'm only going to do the performance analysis of the usages known to behave correctly. (Self-contained so you can copy and paste yourself.)

from timeit import repeat
from itertools import chain

x = dict.fromkeys('abcdefg')
y = dict.fromkeys('efghijk')

def merge_two_dicts(x, y):
    z = x.copy()
    z.update(y)
    return z

min(repeat(lambda: {**x, **y}))
min(repeat(lambda: merge_two_dicts(x, y)))
min(repeat(lambda: {k: v for d in (x, y) for k, v in d.items()}))
min(repeat(lambda: dict(chain(x.items(), y.items()))))
min(repeat(lambda: dict(item for d in (x, y) for item in d.items())))

In Python 3.8.1, NixOS:

>>> min(repeat(lambda: {**x, **y}))
1.0804965235292912
>>> min(repeat(lambda: merge_two_dicts(x, y)))
1.636518670246005
>>> min(repeat(lambda: {k: v for d in (x, y) for k, v in d.items()}))
3.1779992282390594
>>> min(repeat(lambda: dict(chain(x.items(), y.items()))))
2.740647904574871
>>> min(repeat(lambda: dict(item for d in (x, y) for item in d.items())))
4.266070580109954

$ uname -a
Linux nixos 4.19.113 #1-NixOS SMP Wed Mar 25 07:06:15 UTC 2020 x86_64 GNU/Linux

Resources on Dictionaries

My explanation of Python's dictionary implementation, updated for 3.6.
Answer on how to add new keys to a dictionary
Mapping two lists into a dictionary
The official Python docs on dictionaries
The Dictionary Even Mightier - talk by Brandon Rhodes at Pycon 2017
Modern Python Dictionaries, A Confluence of Great Ideas - talk by Raymond Hettinger at Pycon 2017

Python – How to execute a program or call a system command

Use the subprocess module in the standard library:

import subprocess
subprocess.run(["ls", "-l"])

The advantage of subprocess.run over os.system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).

Even the documentation for os.system recommends using subprocess instead:

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

On Python 3.4 and earlier, use subprocess.call instead of .run:

subprocess.call(["ls", "-l"])