Data Structures – How to Store ‘Unknown’ and ‘Missing’ Values in a Variable

data structuresdata typestype-systems

Consider this an "academic" question. I have been wondering about about avoiding NULLs from time to time and this is an example where I can't come up with a satisfactory solution.

Let's assume I store measurements where on occasions the measurement is known to be impossible (or missing). I would like to store that "empty" value in a variable while avoiding NULL. Other times the value could be unknown. So, having the measurements for a certain time-frame, a query about a measurement within that time period could return 3 kinds of responses:

The actual measurement at that time (for example, any numerical value including 0)
A "missing"/"empty" value (i.e., a measurement was done, and the value is known to be empty at that point).
An unknown value (i.e., no measurement has been done at that point. It could be empty, but it could also be any other value).

Important Clarification:

Assuming you had a function get_measurement() returning one of "empty", "unknown" and a value of type "integer". Having a numerical value implies that certain operations can be done on the return value (multiplication, division, …) but using such operations on NULLs will crash the application if not caught.

I would like to be able to write code, avoiding NULL checks, for example (pseudocode):

>>> value = get_measurement()  # returns `2`
>>> print(value * 2)
4

>>> value = get_measurement()  # returns `Empty()`
>>> print(value * 2)
Empty()

>>> value = get_measurement()  # returns `Unknown()`
>>> print(value * 2)
Unknown()

Note that none of the print statements caused exceptions (as no NULLs were used). So the empty & unknown values would propagate as necessary and the check whether a value is actually "unknown" or "empty" could be delayed until really necessary (like storing/serialising the value somewhere).

Side-Note: The reason I'd like to avoid NULLs, is primarily a brain-teaser. If I want to get stuff done I'm not opposed to using NULLs, but I found that avoiding them can make code a lot more robust in some cases.

Best Answer

The common way to do this, at least with functional languages is to use a discriminated union. This is then a value that is one of a valid int, a value that denotes "missing" or a value that denotes "unknown". In F#, it might look something like:

type Measurement =
    | Reading of value : int
    | Missing
    | Unknown of value : RawData

A Measurement value will then be a Reading, with an int value, or a Missing, or an Unknown with the raw data as value (if required).

However, if you aren't using a language that supports discriminated unions, or their equivalent, this pattern isn't likely of much use to you. So there, you could eg use a class with an enum field that denotes which of the three contains the correct data.

Related Solutions

What’s the difference between a stream and a queue

A stream is not really a data structure as such (conceptually), but is a sequence of digitally encoded coherent signals (packets of data or data packets) used to transmit or receive information". So basically a sequence of data.

A queue is a simple FIFO mechanism allowing you to add items to the back of the queue or take from the front.

Streams always have a source, e.g. a file, network location, etc. A Queue does not inherently contain any data.

So essentially they are quite different in concept and as Mason pointed out, they are used differently.

Haskell Type Systems – Baking Dimension into a Type

To expand on @KarlBielefeldt's answer, here's a full example of how to implement Vectors - lists with a statically-known number of elements - in Haskell. Hold on to your hat...

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE ExistentialQuantification #-}
{-# LANGUAGE DeriveFoldable #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE KindSignatures #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE TypeFamilies #-}

import Prelude hiding (foldr, zipWith)
import qualified Prelude
import Data.Type.Equality
import Data.Foldable
import Data.Traversable

As you can see from the long list of LANGUAGE directives, this'll only work with a recent version of GHC.

We need a way of representing lengths within the type system. By definition, a natural number is either zero (Z) or it's the successor of some other natural number (S n). So, for example, the number 3 would be written S (S (S Z)).

data Nat = Z | S Nat

With the DataKinds extension, this data declaration introduces a kind called Nat and two type constructors called S and Z - in other words we have type-level natural numbers. Note that the types S and Z don't have any member values - only types of kind * are inhabited by values.

Now we introduce a GADT representing vectors with a known length. Note the kind signature: Vec requires a type of kind Nat (ie a Z or an S type) to represent its length.

data Vec :: Nat -> * -> * where
    VNil :: Vec Z a
    VCons :: a -> Vec n a -> Vec (S n) a
deriving instance (Show a) => Show (Vec n a)
deriving instance Functor (Vec n)
deriving instance Foldable (Vec n)
deriving instance Traversable (Vec n)

The definition of vectors is similar to that of linked lists, with some extra type-level information about its length. A vector is either VNil, in which case it has a length of Z(ero), or it's a VCons cell adding an item to another vector, in which case its length is one more than the other vector (S n). Note that there's no constructor argument of type n. It's just used at compile time to track lengths, and will be erased before the compiler generates machine code.

We've defined a vector type which carries around static knowledge of its length. Let's query the type of a few Vecs to get a feel for how they work:

ghci> :t (VCons 'a' (VCons 'b' VNil))
(VCons 'a' (VCons 'b' VNil)) :: Vec ('S ('S 'Z)) Char  -- (S (S Z)) means 2
ghci> :t (VCons 13 (VCons 11 (VCons 3 VNil)))
(VCons 13 (VCons 11 (VCons 3 VNil))) :: Num a => Vec ('S ('S ('S 'Z))) a  -- (S (S (S Z))) means 3

The dot product proceeds just as it would for a list:

-- note that the two Vec arguments are declared to have the same length
vap :: Vec n (a -> b) -> Vec n a -> Vec n b
vap VNil VNil = VNil
vap (VCons f fs) (VCons x xs) = VCons (f x) (vap fs xs)

zipWith :: (a -> b -> c) -> Vec n a -> Vec n b -> Vec n c
zipWith f xs ys = fmap f xs `vap` ys

dot :: Num a => Vec n a -> Vec n a -> a
dot xs ys = foldr (+) 0 $ zipWith (*) xs ys

vap, which 'zippily' applies a vector of functions to a vector of arguments, is Vec's applicative <*>; I didn't put it in an Applicative instance because it gets messy. Note also that I'm using the foldr from the compiler-generated instance of Foldable.

Let's try it out:

ghci> let v1 = VCons 2 (VCons 1 VNil)
ghci> let v2 = VCons 4 (VCons 5 VNil)
ghci> v1 `dot` v2
13
ghci> let v3 = VCons 8 (VCons 6 (VCons 1 VNil))
ghci> v1 `dot` v3
<interactive>:20:10:
    Couldn't match type ‘'S 'Z’ with ‘'Z’
    Expected type: Vec ('S ('S 'Z)) a
      Actual type: Vec ('S ('S ('S 'Z))) a
    In the second argument of ‘dot’, namely ‘v3’
    In the expression: v1 `dot` v3

Great! You get a compile-time error when you try to dot vectors whose lengths don't match.

Here's an attempt at a function to concatenate vectors together:

-- This won't compile because the type checker can't deduce the length of the returned vector
-- VNil +++ ys = ys
-- (VCons x xs) +++ ys = VCons x (concat xs ys)

The length of the output vector would be the sum of the lengths of the two input vectors. We need to teach the type checker how to add Nats together. For this we use a type-level function:

type family (n :: Nat) :+: (m :: Nat) :: Nat where
    Z :+: m = m
    (S n) :+: m = S (n :+: m)

This type family declaration introduces a function on types called :+: - in other words, it's a recipe for the type checker to calculate the sum of two natural numbers. It's defined recursively - whenever the left operand is greater than Zero we add one to the output and reduce it by one in the recursive call. (It's a good exercise to write a type function which multiplies two Nats.) Now we can make +++ compile:

infixr 5 +++
(+++) :: Vec n a -> Vec m a -> Vec (n :+: m) a
VNil +++ ys = ys
(VCons x xs) +++ ys = VCons x (concat xs ys)

Here's how you use it:

ghci> VCons 1 (VCons 2 VNil) +++ VCons 3 (VCons 4 VNil)
VCons 1 (VCons 2 (VCons 3 (VCons 4 VNil)))

So far so simple. What about when we want to do the opposite of concatenation and split a vector in two? The lengths of the output vectors depend on the runtime value of the arguments. We'd like to write something like this:

-- this won't work because there aren't any values of type `S` and `Z`
-- split :: (n :: Nat) -> Vec (n :+: m) a -> (Vec n a, Vec m a)

but unfortunately Haskell won't let us do that. Allowing the value of the n argument to appear in the return type (this is commonly called a dependent function or pi type) would require "full-spectrum" dependent types, whereas DataKinds only gives us promoted type constructors. To put it another way, the type constructors S and Z don't appear at the value level. We'll have to settle for singleton values for a run-time representation of a certain Nat.*

data Natty (n :: Nat) where
    Zy :: Natty Z  -- pronounced 'zed-y'
    Sy :: Natty n -> Natty (S n)  -- pronounced 'ess-y'
deriving instance Show (Natty n)

For a given type n (with kind Nat), there is precisely one term of type Natty n. We can use the singleton value as a run-time witness for n: learning about a Natty teaches us about its n and vice versa.

split :: Natty n ->
         Vec (n :+: m) a ->  -- the input Vec has to be at least as long as the input Natty
         (Vec n a, Vec m a)
split Zy xs = (Nil, xs)
split (Sy n) (Cons x xs) = let (ys, zs) = split n xs
                           in (Cons x ys, zs)

Let's take it for a spin:

ghci> split (Sy (Sy Zy)) (VCons 1 (VCons 2 (VCons 3 VNil)))
(VCons 1 (VCons 2 VNil), VCons 3 VNil)
ghci> split (Sy (Sy Zy)) (VCons 3 VNil)
<interactive>:116:21:
    Couldn't match type ‘'S ('Z :+: m)’ with ‘'Z’
    Expected type: Vec ('S ('S 'Z) :+: m) a
      Actual type: Vec ('S 'Z) a
    Relevant bindings include
      it :: (Vec ('S ('S 'Z)) a, Vec m a) (bound at <interactive>:116:1)
    In the second argument of ‘split’, namely ‘(VCons 3 VNil)’
    In the expression: split (Sy (Sy Zy)) (VCons 3 VNil)

In the first example, we successfully split a three-element vector at position 2; then we got a type error when we tried to split a vector at a position past the end. Singletons are the standard technique for making a type depend on a value in Haskell.

_{* The singletons library contains some Template Haskell helpers to generate singleton values like Natty for you.}

Last example. What about when you don't know the dimensionality of your vector statically? For example, what if we're trying to build a vector from run-time data in the form of a list? You need the type of the vector to depend on the length of the input list. To put it another way, we can't use foldr VCons VNil to build a vector because the type of the output vector changes with each iteration of the fold. We need to keep the length of the vector a secret from the compiler.

data AVec a = forall n. AVec (Natty n) (Vec n a)
deriving instance (Show a) => Show (AVec a)

fromList :: [a] -> AVec a
fromList = Prelude.foldr cons nil
    where cons x (AVec n xs) = AVec (Sy n) (VCons x xs)
          nil = AVec Zy VNil

AVec is an existential type: the type variable n does not appear in the return type of the AVec data constructor. We're using it to simulate a dependent pair: fromList can't tell you the length of the vector statically, but it can return something you can pattern-match on to learn the length of the vector - the Natty n in the first element of the tuple. As Conor McBride puts it in a related answer, "You look at one thing, and in doing so, learn about another".

This is a common technique for existentially quantified types. Because you can't actually do anything with data for which you don't know the type - try writing a function of data Something = forall a. Sth a - existentials often come bundled up with GADT evidence which allows you to recover the original type by performing pattern-matching tests. Other common patterns for existentials include packaging up functions to process your type (data AWayToGetTo b = forall a. HeresHow a (a -> b)) which is a neat way of doing first-class modules, or building-in a type class dictionary (data AnOrd = forall a. Ord a => AnOrd a) which can help emulate subtype polymorphism.

ghci> fromList [1,2,3]
AVec (Sy (Sy (Sy Zy))) (VCons 1 (VCons 2 (VCons 3 Nil)))

Dependent pairs are useful whenever the static properties of data depend on dynamic information not available at compile time. Here's filter for vectors:

filter :: (a -> Bool) -> Vec n a -> AVec a
filter f = foldr (\x (AVec n xs) -> if f x
                                    then AVec (Sy n) (VCons x xs)
                                    else AVec n xs) (AVec Zy VNil)

To dot two AVecs, we need to prove to GHC that their lengths are equal. Data.Type.Equality defines a GADT which can only be constructed when its type arguments are the same:

data (a :: k) :~: (b :: k) where
    Refl :: a :~: a  -- short for 'reflexivity'

When you pattern-match on Refl, GHC knows that a ~ b. There are also a few functions to help you work with this type: we'll be using gcastWith to convert between equivalent types, and TestEquality to determine whether two Nattys are equal.

To test the equality of two Nattys, we're going to need to make use of the fact that if two numbers are equal, then their successors are also equal (:~: is congruent over S):

congSuc :: (n :~: m) -> (S n :~: S m)
congSuc Refl = Refl

Pattern matching on Refl on the left-hand side lets GHC know that n ~ m. With that knowledge, it's trivial that S n ~ S m, so GHC lets us return a new Refl right away.

Now we can write an instance of TestEquality by straightforward recursion. If both numbers are zero, they are equal. If both numbers have predecessors, they are equal iff the predecessors are equal. (If they're not equal, just return Nothing.)

instance TestEquality Natty where
    -- testEquality :: Natty n -> Natty m -> Maybe (n :~: m)
    testEquality Zy Zy = Just Refl
    testEquality (Sy n) (Sy m) = fmap congSuc (testEquality n m)  -- check whether the predecessors are equal, then make use of congruence
    testEquality Zy _ = Nothing
    testEquality _ Zy = Nothing

Now we can put the pieces together to dot a pair of AVecs of unknown length.

dot' :: Num a => AVec a -> AVec a -> Maybe a
dot' (AVec n u) (AVec m v) = fmap (\proof -> gcastWith proof (dot u v)) (testEquality n m)

First, pattern match on the AVec constructor to pull out a runtime representation of the vectors' lengths. Now use testEquality to determine whether those lengths are equal. If they are, we'll have Just Refl; gcastWith will use that equality proof to ensure that dot u v is well-typed by discharging its implicit n ~ m assumption.

ghci> let v1 = fromList [1,2,3]
ghci> let v2 = fromList [4,5,6]
ghci> let v3 = fromList [7,8]
ghci> dot' v1 v2
Just 32
ghci> dot' v1 v3
Nothing  -- they weren't the same length

Note that, since a vector without static knowledge of its length is basically a list, we've effectively re-implemented the list version of dot :: Num a => [a] -> [a] -> Maybe a. The difference is that this version is implemented in terms of the vectors' dot. Here's the point: before the type checker will allow you to call dot, you must have tested whether the input lists have the same length using testEquality. I am prone to getting if-statements the wrong way round, but not in a dependently-typed setting!

You can't avoid using existential wrappers at the edges of your system, when you're dealing with runtime data, but you can use dependent types everywhere inside your system and keep the existential wrappers at the edges, when you perform input validation.

Since Nothingis not very informative, you could further refine the type of dot' to return a proof that the lengths aren't equal (in the form of evidence that their difference is not 0) in the failure case. This is pretty similar to the standard Haskell technique of using Either String a to possibly return an error message, although a proof term is far more computationally useful than a string!

Thus ends this whistle-stop tour of some of the techniques that are common in dependently-typed Haskell programming. Programming with types like this in Haskell is really cool, but really awkward at the same time. Breaking up all your dependent data into lots of representations which mean the same thing - Nat the type, Nat the kind, Natty n the singleton - is really quite cumbersome, despite the existence of code-generators to help with the boilerplate. There are also presently limitations on what can be promoted to the type level. It's tantalising though! The mind boggles at the possibilities - in the literature there are examples in Haskell of strongly typed printf, database interfaces, UI layout engines...

If you want some further reading, there's a growing body of literature about dependently typed Haskell, both published and on sites like Stack Overflow. A good starting point is the Hasochism paper - the paper goes through this very example (among others), discussing the painful parts in some detail. The Singletons paper demonstrates the technique of singleton values (such as Natty). For more information about dependent typing in general, the Agda tutorial is a good place to start; also, Idris is a language in development that's (roughly) designed to be "Haskell with dependent types".

Best Answer

Related Solutions

What’s the difference between a stream and a queue

Haskell Type Systems – Baking Dimension into a Type

Related Topic