Java OOP – Strategies for Encapsulating Shared Data in Software Pipeline

javaobject-oriented-design

I'm working on re-factoring certain aspects of an existing web service. The way the service APIs are implemented is by having a kind of "processing pipeline", where there are tasks that are performed in sequence. Unsurprisingly, later tasks may need information computed by earlier tasks, and currently the way this is done is by adding fields to a "pipeline state" class.

I've been thinking (and hoping?) that there's a better way to share information between pipeline steps than having a data object with a zillion fields, some of which make sense to some processing steps and not to others. It would be a major pain to make this class thread-safe (I don't know if it would even be possible), there is no way to reason about its invariants (and it's likely it doesn't have any).

I was paging through the Gang of Four design patterns book to find some inspiration, but I didn't feel like there was a solution in there (Memento was somewhat in the same spirit, but not quite). I also looked online, but the second you search for "pipeline" or "workflow" you get flooded with either Unix pipes information, or proprietary workflow engines and frameworks.

My question is – how would you approach the issue of recording the execution state of a software processing pipeline, so that later tasks can use information computed by earlier ones? I guess the major difference with Unix pipes is that you don't just care about the output of the immediately preceding task.


As requested, some pseudocode to illustrate my use case:

The "pipeline context" object has a bunch of fields that the different pipeline steps can populate/read:

public class PipelineCtx {
    ... // fields
    public Foo getFoo() { return this.foo; }
    public void setFoo(Foo aFoo) { this.foo = aFoo; }
    public Bar getBar() { return this.bar; }
    public void setBar(Bar aBar) { this.bar = aBar; }
    ... // more methods
}

Each of the pipeline steps is also an object:

public abstract class PipelineStep {
    public abstract PipelineCtx doWork(PipelineCtx ctx);
}

public class BarStep extends PipelineStep {
    @Override
    public PipelineCtx doWork(PipelieCtx ctx) {
        // do work based on the stuff in ctx
        Bar theBar = ...; // compute it
        ctx.setBar(theBar);

        return ctx;
    }
}

Similarly for a hypothetical FooStep, which might need the Bar computed by BarStep before it, along with other data. And then we have the real API call:

public class BlahOperation extends ProprietaryWebServiceApiBase {
    public BlahResponse handle(BlahRequest request) {
        PipelineCtx ctx = PipelineCtx.from(request);

        // some steps happen here
        // ...

        BarStep barStep = new BarStep();
        barStep.doWork(crx);

        // some more steps maybe
        // ...

        FooStep fooStep = new FooStep();
        fooStep.doWork(ctx);

        // final steps ...

        return BlahResponse.from(ctx);
    }
}

Best Answer

The main reason to use a pipeline design is that you want to decouple the stages. Either because one stage may be used in multiple pipelines (like the Unix shell tools), or because you gain some scaling benefit (ie, you can easily move from a single-node architecture to a multi-node architecture).

In either case, each stage in the pipeline needs to be given everything that it needs to do its job. There's no reason that you can't use an external store (eg, database), but in most cases it's better to pass the data from one stage to another.

However, that doesn't mean that you must or should pass one big message object with every possible field (although see below). Instead, each stage in the pipeline should define interfaces for its input and output messages, that identify just the data that stage needs.

You then have a lot of flexibility in how you implement your actual message objects. One approach is to use a huge data object that implements all the necessary interfaces. Another is to create wrapper classes around a simple Map. Still another is to create a wrapper class around a database.

Related Topic