Haskell Memory Efficiency – Best Approaches

functional programminggrammarhaskellmemory

We are implementing a matrix compression library based on a modified two dimensional grammar syntax. Now we have two approaches for our data types -which one will be better in case of memory usage? (we want to compress something ;)).

The grammars contain NonTerminals with exactly 4 Productions or a Terminal on the righthand side. We will need the names of Productions for equality checks and grammar minimization.

The First:

-- | Type synonym for non-terminal symbols
type NonTerminal = String

-- | Data type for the right hand side of a production
data RightHandSide = DownStep NonTerminal NonTerminal NonTerminal NonTerminal | Terminal Int

-- | Data type for a set of productions
type ProductionMap = Map NonTerminal RightHandSide

data MatrixGrammar = MatrixGrammar {
    -- the start symbol
    startSymbol :: NonTerminal,
    -- productions
    productions :: ProductionMap    
    } 

Here our RightHandSide data saves only String names to determine the next productions, and what we do not know here is how Haskell saves these strings. For example the [[0, 0], [0, 0]] matrix has 2 productions:

a = Terminal 0
aString = "A"
b = DownStep aString aString aString aString
bString = "B"
productions = Map.FromList [(aString, a), (bString, b)]

So the question here is how often is the String "A" really saved? Once in aString, 4 times in b and once in productions or just once in aString and the others just hold "cheaper" references?

The Second:

data Production = NonTerminal String Production Production Production Production
                | Terminal String Int 

type ProductionMap = Map String Production

here the term "Terminal" is a bit misleading because its actually the production that has a terminal as right hand side.
The same Matrix:

a = Terminal "A" 0
b = NonTerminal "B" a a a a
productions = Map.fromList [("A", a), ("B", b)]

and the similar question: how often is the production a saved internally by Haskell? Possibly we will drop the names inside the productions if we don't need them, but we are not sure right now about this.

So lets say we have a grammar with about 1000 productions. Which approach will consume less memory?

Finally a question about integers in Haskell: Currently we are planning on having name as Strings. But we could easily switch to integer names because with 1000 productions we will have names with more then 4 chars (which i assume is 32 bit?). How does Haskell handle this. Is an Int always 32 Bit and Integer allocates memory that it really needs?

I also read through this: Devising test of Haskell's value/reference semantics – but I can't figure out what that exactly means for us – I'm more of a imperative java child then good functional programmer 😛

Best Answer

You can expand your matrix grammar into an ADT with perfect sharing with a little bit of trickery:

{-# LANGUAGE DeriveFunctor, DeriveFoldable, DeriveTraversable #-}

import Data.Map
import Data.Foldable
import Data.Functor
import Data.Traversable

-- | Type synonym for non-terminal symbols
type NonTerminal = String

-- | Data type for the right hand side of a production
data RHS a = DownStep NonTerminal NonTerminal NonTerminal NonTerminal | Terminal a
  deriving (Eq,Ord,Show,Read,Functor, Foldable, Traversable)

data G a = G NonTerminal (Map NonTerminal (RHS a))
  deriving (Eq,Ord,Show,Read,Functor)

data M a = Q (M a) (M a) (M a) (M a) | T a
  deriving (Functor, Foldable, Traversable)

tabulate :: G a -> M a
tabulate (G s pm) = loeb (expand <$> pm) ! s where
  expand (DownStep a11 a12 a21 a22) m = Q (m!a11) (m!a12) (m!a21) (m!a22)
  expand (Terminal a)               _ = T a

loeb :: Functor f => f (f b -> b) -> f b
loeb x = xs where xs = fmap ($xs) x

Here I generalized your grammars to allow for any data type, not just Int, and tabulate will take the grammar and expand it by folding it in upon itself using loeb.

loeb is described in an article by Dan Piponi

The resulting expansion as an ADT physically takes no more memory than the original grammar -- in fact it takes a fair bit less, because it doesn't need the extra log-factor for the Map spine, and it doesn't need to store the strings at all.

Unlike the naive expansion, using loeb lets me 'tie the knot' and share the thunks for all occurrences of the same non-terminal.

If you want to dip more into the theory of all of this, we can see that RHS could be turned into a base functor:

data RHS t nt = Q nt nt nt nt | L t

and then my M type is just the fixed point of that Functor.

M a ~ Mu (RHS a)

while G a would consist of a chosen string and a map from strings to (RHS String a).

We can then expand G into M by lookup up the entry in a map of expanded strings lazily.

This is sort of the dual of what is done in the data-reify package, which can take such a base functor, and something like M and recover the moral equivalent of your G from it. They use a different type for the non-terminal names, which is basically just an Int.

data Graph e = Graph [(Unique, e Unique)] Unique

and provide a combinator

reifyGraph :: MuRef s => s -> IO (Graph (DeRef s))

which can be used with an appropriate instance on the above data types to get a graph (MatrixGrammar) out of an arbitrary matrix. It won't do deduplication of identical but separately stored quadrants, but it'll recover all of the sharing that is present in the original graph.

Related Topic