Haskell – Memory footprint of Haskell data types

algebraic-data-typesghchaskellmemory-management

How can I find the actual amount of memory required to store a value of some data type in Haskell (mostly with GHC)? Is it possible to evaluate it at runtime (e.g. in GHCi) or is it possible to estimate memory requirements of a compound data type from its components?

In general, if memory requirements of types a and b are known, what is the memory overhead of algebraic data types such as:

data Uno = Uno a
data Due = Due a b

For example, how many bytes in memory do these values occupy?

1 :: Int8
1 :: Integer
2^100 :: Integer
\x -> x + 1
(1 :: Int8, 2 :: Int8)
[1] :: [Int8]
Just (1 :: Int8)
Nothing

I understand that actual memory allocation is higher due to delayed garbage collection. It may be significantly different due to lazy evaluation (and thunk size is not related to the size of the value). The question is, given a data type, how much memory does its value take when fully evaluated?

I found there is a :set +s option in GHCi to see memory stats, but it is not clear how to estimate the memory footprint of a single value.

Best Answer

(The following applies to GHC, other compilers may use different storage conventions)

Rule of thumb: a constructor costs one word for a header, and one word for each field. Exception: a constructor with no fields (like Nothing or True) takes no space, because GHC creates a single instance of these constructors and shares it amongst all uses.

A word is 4 bytes on a 32-bit machine, and 8 bytes on a 64-bit machine.

So e.g.

data Uno = Uno a
data Due = Due a b

an Uno takes 2 words, and a Due takes 3.

The Int type is defined as

data Int = I# Int#

now, Int# takes one word, so Int takes 2 in total. Most unboxed types take one word, the exceptions being Int64#, Word64#, and Double# (on a 32-bit machine) which take 2. GHC actually has a cache of small values of type Int and Char, so in many cases these take no heap space at all. A String only requires space for the list cells, unless you use Chars > 255.

An Int8 has identical representation to Int. Integer is defined like this:

data Integer
  = S# Int#                            -- small integers
  | J# Int# ByteArray#                 -- large integers

so a small Integer (S#) takes 2 words, but a large integer takes a variable amount of space depending on its value. A ByteArray# takes 2 words (header + size) plus space for the array itself.

Note that a constructor defined with newtype is free. newtype is purely a compile-time idea, and it takes up no space and costs no instructions at run time.

More details in The Layout of Heap Objects in the GHC Commentary.

Related Solutions

Haskell – Getting started with Haskell

I'm going to order this guide by the level of skill you have in Haskell, going from an absolute beginner right up to an expert. Note that this process will take many months (years?), so it is rather long.

Absolute Beginner

Firstly, Haskell is capable of anything, with enough skill. It is very fast (behind only C and C++ in my experience), and can be used for anything from simulations to servers, guis and web applications.

However there are some problems that are easier to write for a beginner in Haskell than others. Mathematical problems and list process programs are good candidates for this, as they only require the most basic of Haskell knowledge to be able to write.

Some good guides to learning the very basics of Haskell are the Happy Learn Haskell Tutorial and the first 6 chapters of Learn You a Haskell for Great Good (or its JupyterLab adaptation). While reading these, it is a very good idea to also be solving simple problems with what you know.

Another two good resources are Haskell Programming from first principles, and Programming in Haskell. They both come with exercises for each chapter, so you have small simple problems matching what you learned on the last few pages.

A good list of problems to try is the haskell 99 problems page. These start off very basic, and get more difficult as you go on. It is very good practice doing a lot of those, as they let you practice your skills in recursion and higher order functions. I would recommend skipping any problems that require randomness as that is a bit more difficult in Haskell. Check this SO question in case you want to test your solutions with QuickCheck (see Intermediate below).

Once you have done a few of those, you could move on to doing a few of the Project Euler problems. These are sorted by how many people have completed them, which is a fairly good indication of difficulty. These test your logic and Haskell more than the previous problems, but you should still be able to do the first few. A big advantage Haskell has with these problems is Integers aren't limited in size. To complete some of these problems, it will be useful to have read chapters 7 and 8 of learn you a Haskell as well.

Beginner

After that you should have a fairly good handle on recursion and higher order functions, so it would be a good time to start doing some more real world problems. A very good place to start is Real World Haskell (online book, you can also purchase a hard copy). I found the first few chapters introduced too much too quickly for someone who has never done functional programming/used recursion before. However with the practice you would have had from doing the previous problems you should find it perfectly understandable.

Working through the problems in the book is a great way of learning how to manage abstractions and building reusable components in Haskell. This is vital for people used to object-orientated (oo) programming, as the normal oo abstraction methods (oo classes) don't appear in Haskell (Haskell has type classes, but they are very different to oo classes, more like oo interfaces). I don't think it is a good idea to skip chapters, as each introduces a lot new ideas that are used in later chapters.

After a while you will get to chapter 14, the dreaded monads chapter (dum dum dummmm). Almost everyone who learns Haskell has trouble understanding monads, due to how abstract the concept is. I can't think of any concept in another language that is as abstract as monads are in functional programming. Monads allows many ideas (such as IO operations, computations that might fail, parsing,...) to be unified under one idea. So don't feel discouraged if after reading the monads chapter you don't really understand them. I found it useful to read many different explanations of monads; each one gives a new perspective on the problem. Here is a very good list of monad tutorials. I highly recommend the All About Monads, but the others are also good.

Also, it takes a while for the concepts to truly sink in. This comes through use, but also through time. I find that sometimes sleeping on a problem helps more than anything else! Eventually, the idea will click, and you will wonder why you struggled to understand a concept that in reality is incredibly simple. It is awesome when this happens, and when it does, you might find Haskell to be your favorite imperative programming language :)

To make sure that you are understanding Haskell type system perfectly, you should try to solve 20 intermediate haskell exercises. Those exercises using fun names of functions like "furry" and "banana" and helps you to have a good understanding of some basic functional programming concepts if you don't have them already. Nice way to spend your evening with a bunch of papers covered with arrows, unicorns, sausages and furry bananas.

Intermediate

Once you understand Monads, I think you have made the transition from a beginner Haskell programmer to an intermediate haskeller. So where to go from here? The first thing I would recommend (if you haven't already learnt them from learning monads) is the various types of monads, such as Reader, Writer and State. Again, Real world Haskell and All about monads gives great coverage of this. To complete your monad training learning about monad transformers is a must. These let you combine different types of Monads (such as a Reader and State monad) into one. This may seem useless to begin with, but after using them for a while you will wonder how you lived without them.

Now you can finish the real world Haskell book if you want. Skipping chapters now doesn't really matter, as long as you have monads down pat. Just choose what you are interested in.

With the knowledge you would have now, you should be able to use most of the packages on cabal (well the documented ones at least...), as well as most of the libraries that come with Haskell. A list of interesting libraries to try would be:

Parsec: for parsing programs and text. Much better than using regexps. Excellent documentation, also has a real world Haskell chapter.
QuickCheck: A very cool testing program. What you do is write a predicate that should always be true (eg length (reverse lst) == length lst). You then pass the predicate the QuickCheck, and it will generate a lot of random values (in this case lists) and test that the predicate is true for all results. See also the online manual.
HUnit: Unit testing in Haskell.
gtk2hs: The most popular gui framework for Haskell, lets you write gtk applications.
happstack: A web development framework for Haskell. Doesn't use databases, instead a data type store. Pretty good docs (other popular frameworks would be snap and yesod).

Also, there are many concepts (like the Monad concept) that you should eventually learn. This will be easier than learning Monads the first time, as your brain will be used to dealing with the level of abstraction involved. A very good overview for learning about these high level concepts and how they fit together is the Typeclassopedia.

Applicative: An interface like Monads, but less powerful. Every Monad is Applicative, but not vice versa. This is useful as there are some types that are Applicative but are not Monads. Also, code written using the Applicative functions is often more composable than writing the equivalent code using the Monad functions. See Functors, Applicative Functors and Monoids from the learn you a haskell guide.
Foldable,Traversable: Typeclasses that abstract many of the operations of lists, so that the same functions can be applied to other container types. See also the haskell wiki explanation.
Monoid: A Monoid is a type that has a zero (or mempty) value, and an operation, notated <> that joins two Monoids together, such that x <> mempty = mempty <> x = x and x <> (y <> z) = (x <> y) <> z. These are called identity and associativity laws. Many types are Monoids, such as numbers, with mempty = 0 and <> = +. This is useful in many situations.
Arrows: Arrows are a way of representing computations that take an input and return an output. A function is the most basic type of arrow, but there are many other types. The library also has many very useful functions for manipulating arrows - they are very useful even if only used with plain old Haskell functions.
Arrays: the various mutable/immutable arrays in Haskell.
ST Monad: lets you write code with a mutable state that runs very quickly, while still remaining pure outside the monad. See the link for more details.
FRP: Functional Reactive Programming, a new, experimental way of writing code that handles events, triggers, inputs and outputs (such as a gui). I don't know much about this though. Paul Hudak's talk about yampa is a good start.

There are a lot of new language features you should have a look at. I'll just list them, you can find lots of info about them from google, the haskell wikibook, the haskellwiki.org site and ghc documentation.

Multiparameter type classes/functional dependencies
Type families
Existentially quantified types
Phantom types
GADTS
others...

A lot of Haskell is based around category theory, so you may want to look into that. A good starting point is Category Theory for Computer Scientist. If you don't want to buy the book, the author's related article is also excellent.

Finally you will want to learn more about the various Haskell tools. These include:

ghc (and all its features)
cabal: the Haskell package system
darcs: a distributed version control system written in Haskell, very popular for Haskell programs.
haddock: a Haskell automatic documentation generator

While learning all these new libraries and concepts, it is very useful to be writing a moderate-sized project in Haskell. It can be anything (e.g. a small game, data analyser, website, compiler). Working on this will allow you to apply many of the things you are now learning. You stay at this level for ages (this is where I'm at).

Expert

It will take you years to get to this stage (hello from 2009!), but from here I'm guessing you start writing phd papers, new ghc extensions, and coming up with new abstractions.

Getting Help

Finally, while at any stage of learning, there are multiple places for getting information. These are:

the #haskell irc channel
the mailing lists. These are worth signing up for just to read the discussions that take place - some are very interesting.
other places listed on the haskell.org home page

Conclusion

Well this turned out longer than I expected... Anyway, I think it is a very good idea to become proficient in Haskell. It takes a long time, but that is mainly because you are learning a completely new way of thinking by doing so. It is not like learning Ruby after learning Java, but like learning Java after learning C. Also, I am finding that my object-orientated programming skills have improved as a result of learning Haskell, as I am seeing many new ways of abstracting ideas.

C++ – Can a local variable’s memory be accessed outside its scope

How can it be? Isn't the memory of a local variable inaccessible outside its function?

You rent a hotel room. You put a book in the top drawer of the bedside table and go to sleep. You check out the next morning, but "forget" to give back your key. You steal the key!

A week later, you return to the hotel, do not check in, sneak into your old room with your stolen key, and look in the drawer. Your book is still there. Astonishing!

How can that be? Aren't the contents of a hotel room drawer inaccessible if you haven't rented the room?

Well, obviously that scenario can happen in the real world no problem. There is no mysterious force that causes your book to disappear when you are no longer authorized to be in the room. Nor is there a mysterious force that prevents you from entering a room with a stolen key.

The hotel management is not required to remove your book. You didn't make a contract with them that said that if you leave stuff behind, they'll shred it for you. If you illegally re-enter your room with a stolen key to get it back, the hotel security staff is not required to catch you sneaking in. You didn't make a contract with them that said "if I try to sneak back into my room later, you are required to stop me." Rather, you signed a contract with them that said "I promise not to sneak back into my room later", a contract which you broke.

In this situation anything can happen. The book can be there -- you got lucky. Someone else's book can be there and yours could be in the hotel's furnace. Someone could be there right when you come in, tearing your book to pieces. The hotel could have removed the table and book entirely and replaced it with a wardrobe. The entire hotel could be just about to be torn down and replaced with a football stadium, and you are going to die in an explosion while you are sneaking around.

You don't know what is going to happen; when you checked out of the hotel and stole a key to illegally use later, you gave up the right to live in a predictable, safe world because you chose to break the rules of the system.

C++ is not a safe language. It will cheerfully allow you to break the rules of the system. If you try to do something illegal and foolish like going back into a room you're not authorized to be in and rummaging through a desk that might not even be there anymore, C++ is not going to stop you. Safer languages than C++ solve this problem by restricting your power -- by having much stricter control over keys, for example.

UPDATE

Holy goodness, this answer is getting a lot of attention. (I'm not sure why -- I considered it to be just a "fun" little analogy, but whatever.)

I thought it might be germane to update this a bit with a few more technical thoughts.

Compilers are in the business of generating code which manages the storage of the data manipulated by that program. There are lots of different ways of generating code to manage memory, but over time two basic techniques have become entrenched.

The first is to have some sort of "long lived" storage area where the "lifetime" of each byte in the storage -- that is, the period of time when it is validly associated with some program variable -- cannot be easily predicted ahead of time. The compiler generates calls into a "heap manager" that knows how to dynamically allocate storage when it is needed and reclaim it when it is no longer needed.

The second method is to have a “short-lived” storage area where the lifetime of each byte is well known. Here, the lifetimes follow a “nesting” pattern. The longest-lived of these short-lived variables will be allocated before any other short-lived variables, and will be freed last. Shorter-lived variables will be allocated after the longest-lived ones, and will be freed before them. The lifetime of these shorter-lived variables is “nested” within the lifetime of longer-lived ones.

Local variables follow the latter pattern; when a method is entered, its local variables come alive. When that method calls another method, the new method's local variables come alive. They'll be dead before the first method's local variables are dead. The relative order of the beginnings and endings of lifetimes of storages associated with local variables can be worked out ahead of time.

For this reason, local variables are usually generated as storage on a "stack" data structure, because a stack has the property that the first thing pushed on it is going to be the last thing popped off.

It's like the hotel decides to only rent out rooms sequentially, and you can't check out until everyone with a room number higher than you has checked out.

So let's think about the stack. In many operating systems you get one stack per thread and the stack is allocated to be a certain fixed size. When you call a method, stuff is pushed onto the stack. If you then pass a pointer to the stack back out of your method, as the original poster does here, that's just a pointer to the middle of some entirely valid million-byte memory block. In our analogy, you check out of the hotel; when you do, you just checked out of the highest-numbered occupied room. If no one else checks in after you, and you go back to your room illegally, all your stuff is guaranteed to still be there in this particular hotel.

We use stacks for temporary stores because they are really cheap and easy. An implementation of C++ is not required to use a stack for storage of locals; it could use the heap. It doesn't, because that would make the program slower.

An implementation of C++ is not required to leave the garbage you left on the stack untouched so that you can come back for it later illegally; it is perfectly legal for the compiler to generate code that turns back to zero everything in the "room" that you just vacated. It doesn't because again, that would be expensive.

An implementation of C++ is not required to ensure that when the stack logically shrinks, the addresses that used to be valid are still mapped into memory. The implementation is allowed to tell the operating system "we're done using this page of stack now. Until I say otherwise, issue an exception that destroys the process if anyone touches the previously-valid stack page". Again, implementations do not actually do that because it is slow and unnecessary.

Instead, implementations let you make mistakes and get away with it. Most of the time. Until one day something truly awful goes wrong and the process explodes.

This is problematic. There are a lot of rules and it is very easy to break them accidentally. I certainly have many times. And worse, the problem often only surfaces when memory is detected to be corrupt billions of nanoseconds after the corruption happened, when it is very hard to figure out who messed it up.

More memory-safe languages solve this problem by restricting your power. In "normal" C# there simply is no way to take the address of a local and return it or store it for later. You can take the address of a local, but the language is cleverly designed so that it is impossible to use it after the lifetime of the local ends. In order to take the address of a local and pass it back, you have to put the compiler in a special "unsafe" mode, and put the word "unsafe" in your program, to call attention to the fact that you are probably doing something dangerous that could be breaking the rules.

For further reading:

What if C# did allow returning references? Coincidentally that is the subject of today's blog post:

https://ericlippert.com/2011/06/23/ref-returns-and-ref-locals/
Why do we use stacks to manage memory? Are value types in C# always stored on the stack? How does virtual memory work? And many more topics in how the C# memory manager works. Many of these articles are also germane to C++ programmers:

https://ericlippert.com/tag/memory-management/

Best Answer

Related Solutions

Haskell – Getting started with Haskell

C++ – Can a local variable’s memory be accessed outside its scope

UPDATE

Related Topic