Design – How to design for good abstractions using algebraic data type

algebraic-data-typedesignfunctional programming

Every now and then I have peaked at Haskell Tutorials and found the Algebraic data types quite interesting. I took their purpose to be to represent types that have completely separable states. Sadly, I never got to write more Haskell than tutorial level projects, and so I never had to really design programs using this pattern.

Now I am writing some Rust, and I have algebraic datatypes (enums) in the toolbox. However, I am not very confident in using them.

Let me start with an example where I am confident that such an enum is a proper choice.

enum Tree {
    Leaf(i: String),
    Branch(Tree, Tree)
}

The same example would be applicaple for an XML-like structures, etc.

With other data, I am not so confident about using enum types. Let's take a connection object

enum Connection {
    UnConnected(...),
    ConnectedConnection(....)
}

Here we would have a Connection type with two possible values, one representing the state where a connection is not yet established, the other one could represent the state of a connected connection (and wrapping a connection handle for example).

The other possibility would be to introduce 2 types for a Connection-Template and a connected connection.

Another example that I found in rust code is that in the hyper library. There is the type Response that represents a HTTP response. It is a generic type.
Response<Fresh> represents the state where headers are not yet frozen. Once upon it is mapped to a Response<Streaming>, which can be used to write the body of the response. It seems like what Hyper models here with different types (Response<Fresh> vs Response<Streaming>) could have been modeled with enum types as well.

The approach with different types allows for more safety, Response<Fresh> does not implement streaming (I think) and Response<Streamimg> does.

Do you know of guidelines and best practices that guide through modelling logic in types properly?

Best Answer

Even though you said algebraic data types, you seem to mostly be asking about sum types, so I will focus on those. Product types are more common and more easily understood.

Sum types are most easily understood not by thinking about what you're modeling, but by thinking about the code that uses it. People tend to think of sum types as representing states that are mutually exclusive, but this isn't the entire picture. To use a sum type, you should also have a use for the encompassing type. This represents an indeterminate state, where at the point of calling the function, you know you either have or need one of the term types, but you don't know which one. If you always know which one statically, you should just create separate types.

For your Tree example, that means you should have functions that actually take or return a Tree. As you traverse down the tree, you have this state where you don't know if the next node down is going to be a Leaf or a Branch, so you need a type that can be either. If you only had functions that take or return Leafs or Branches, you wouldn't need a sum type.

Your Connection example is a completely different matter. At some point, you have a connect function that takes in an Unconnected and returns a Connected. At some later point, you have a disconnect function that does the reverse. There is no point where you don't know statically if a connection is connected or not, therefore you don't need a type that can hold either.

If you have a hard time seeing appropriate situations to use sum types, my recommendation is to start out not using them. If they are appropriate, at some point you'll hit a function that can't be written otherwise, then you can add it then. Forcing a sum type when it's not needed leads to a lot of unnecessary pattern matching that could be much cleaner with separate functions.

Related Solutions

Encode Algebraic Data Types – How to Encode Algebraic Data Types in C# or Java

There is an easy, but boilerplate heavy way to seal classes in Java. You put a private constructor in the base class then make subclasses inner classes of it.

public abstract class List<A> {

   // private constructor is uncallable by any sublclasses except inner classes
   private List() {
   }

   public static final class Nil<A> extends List<A> {
   }

   public static final class Cons<A> extends List<A> {
      public final A head;
      public final List<A> tail;

      public Cons(A head, List<A> tail) {
         this.head = head;
         this.tail = tail;
      }
   }
}

Tack on a visitor pattern for dispatch.

My project jADT : Java Algebraic DataTypes generates all that boilerplate for you https://github.com/JamesIry/jADT

Design – How do we keep dependent data structures up to date

I think your scenarios are discussing variations on the Observer Pattern. Each original node (“subject”) has (at least) the following two methods:

registerObserver(observer) – adds a dependent node to the list of observers.
notifyObservers() – calls x.notify(this) on each observer

And each dependent node (“observer”) has a notify(original) method. Comparing your scenarios:

The notify method immediately rebuilds a dependent subtree.
The notify method sets a flag, the recomputation happens after each set of updates.
The notifyObservers method is smart and only notifies those observers whose constraints are invalidated. This would probably use the Visitor Pattern, so that the dependent nodes can offer a method that decides this.
(this pattern has no relation to brute-force rebuilding)

As the first three ideas are just variations on the observer pattern, their design will have similar complexity (as it happens, they are actually ordered in increasing complexity – I'd think №1 is the most simple to implement).

I can think of one enhancement: building the dependent trees lazily. Each dependent node would then have a boolean flag that is either set to valid or invalid. Each accessor method would check this flag and, if necessary, recalculate the subtree. The difference to №2 is that recalculation happens on access, not upon change. This would probably result in fewest computations, but can lead to significant difficulties if the type of a node would have to change upon access.

I would also like to challenge the need for multiple dependent trees. For example, I always structure my parsers in a way that they immediately emit an AST. Information that is only relevant during construction of this tree doesn't have to be stored in any permanent data structure. Likewise, you can also choose your objects in such a way that the AST has an interpretation as a control flow graph.

For a real-life example, the compiler part inside the perl interpreter does this: The AST is built bottom-up, during which some nodes are constant-folded away. In a second run, the nodes are connected in execution order, during which some nodes are skipped by optimizations. The result is very fast parsing (and few allocations), but very limited optimizations. It should be noted that while such a design is possible, it is probably not something you should strive for: It is a ~~calculated trade-off~~ complete violation of the Single-Responsibility Principle.

If you do actually need multiple trees, then you should also consider whether they really have to be built simultaneously. In the majority of cases, a parse tree is constant after the parse. Likewise, an AST will probably stay constant after macros are resolved and AST-level optimizations have been executed.

Best Answer

Related Solutions

Encode Algebraic Data Types – How to Encode Algebraic Data Types in C# or Java

Design – How do we keep dependent data structures up to date

Related Topic