Haskell: Lists, Arrays, Vectors, Sequences

haskell

I'm learning Haskell and read a couple of articles regarding performance differences of Haskell lists and (insert your language)'s arrays.

Being a learner I obviously just use lists without even thinking about performance difference.
I recently started investigating and found numerous data structure libraries available in Haskell.

Can someone please explain the difference between Lists, Arrays, Vectors, Sequences without going very deep in computer science theory of data structures?

Also, are there some common patterns where you would use one data structure instead of another?

Are there any other forms of data structures that I am missing and might be useful?

Best Answer

Lists Rock

By far the most friendly data structure for sequential data in Haskell is the List

 data [a] = a:[a] | []

Lists give you ϴ(1) cons and pattern matching. The standard library, and for that matter the prelude, is full of useful list functions that should litter your code (foldr,map,filter). Lists are persistant , aka purely functional, which is very nice. Haskell lists aren't really "lists" because they are coinductive (other languages call these streams) so things like

ones :: [Integer]
ones = 1:ones

twos = map (+1) ones

tenTwos = take 10 twos

work wonderfully. Infinite data structures rock.

Lists in Haskell provide an interface much like iterators in imperative languages (because of laziness). So, it makes sense that they are widely used.

On the other hand

The first problem with lists is that to index into them (!!) takes ϴ(k) time, which is annoying. Also, appends can be slow ++, but Haskell's lazy evaluation model means that these can be treated as fully amortized, if they happen at all.

The second problem with lists is that they have poor data locality. Real processors incur high constants when objects in memory are not laid out next to each other. So, in C++ std::vector has faster "snoc" (putting objects at the end) than any pure linked list data structure I know of, although this is not a persistant data structure so less friendly than Haskell's lists.

The third problem with lists is that they have poor space efficiency. Bunches of extra pointers push up your storage (by a constant factor).

Sequences Are Functional

Data.Sequence is internally based on finger trees (I know, you don't want to know this) which means that they have some nice properties

Purely functional. Data.Sequence is a fully persistant data structure.
Darn fast access to the beginning and end of the tree. ϴ(1) (amortized) to get the first or last element, or to append trees. At the thing lists are fastest at, Data.Sequence is at most a constant slower.
ϴ(log n) access to the middle of the sequence. This includes inserting values to make new sequences
High quality API

On the other hand, Data.Sequence doesn't do much for the data locality problem, and only works for finite collections (it is less lazy than lists)

Arrays are not for the faint of heart

Arrays are one of the most important data structures in CS, but they dont fit very well with the lazy pure functional world. Arrays provide ϴ(1) access to the middle of the collection and exceptionally good data locality/constant factors. But, since they dont fit very well into Haskell, they are a pain to use. There are actually a multitude of different array types in the current standard library. These include fully persistant arrays, mutable arrays for the IO monad, mutable arrays for the ST monad, and un-boxed versions of the above. For more check out the haskell wiki

Vector is a "better" Array

The Data.Vector package provides all of the array goodness, in a higher level and cleaner API. Unless you really know what you are doing, you should use these if you need array like performance. Of-course, some caveats still apply--mutable array like data structures just dont play nice in pure lazy languages. Still, sometimes you want that O(1) performance, and Data.Vector gives it to you in a useable package.

You have other options

If you just want lists with the ability to efficiently insert at the end, you can use a difference list. The best example of lists screwing up performance tends to come from [Char] which the prelude has aliased as String. Char lists are convient, but tend to run on the order of 20 times slower than C strings, so feel free to use Data.Text or the very fast Data.ByteString. I'm sure there are other sequence oriented libraries I'm not thinking of right now.

Conclusion

90+% of the time I need a sequential collection in Haskell lists are the right data structure. Lists are like iterators, functions that consume lists can easily be used with any of these other data structures using the toList functions they come with. In a better world the prelude would be fully parametric as to what container type it uses, but currently [] litters the standard library. So, using lists (almost) every where is definitely okay.
You can get fully parametric versions of most of the list functions (and are noble to use them)

Prelude.map                --->  Prelude.fmap (works for every Functor)
Prelude.foldr/foldl/etc    --->  Data.Foldable.foldr/foldl/etc
Prelude.sequence           --->  Data.Traversable.sequence
etc

In fact, Data.Traversable defines an API that is more or less universal across any thing "list like".

Still, although you can be good and write only fully parametric code, most of us are not and use list all over the place. If you are learning, I strongly suggest you do too.

EDIT: Based on comments I realize I never explained when to use Data.Vector vs Data.Sequence. Arrays and Vectors provide extremely fast indexing and slicing operations, but are fundamentally transient (imperative) data structures. Pure functional data structures like Data.Sequence and [] let efficiently produce new values from old values as if you had modified the old values.

  newList oldList = 7 : drop 5 oldList

doesn't modify old list, and it doesn't have to copy it. So even if oldList is incredibly long, this "modification" will be very fast. Similarly

  newSequence newValue oldSequence = Sequence.update 3000 newValue oldSequence

will produce a new sequence with a newValue for in the place of its 3000 element. Again, it doesn't destroy the old sequence, it just creates a new one. But, it does this very efficiently, taking O(log(min(k,k-n)) where n is the length of the sequence, and k is the index you modify.

You cant easily do this with Vectors and Arrays. They can be modified but that is real imperative modification, and so cant be done in regular Haskell code. That means operations in the Vector package that make modifications like snoc and cons have to copy the entire vector so take O(n) time. The only exception to this is that you can use the mutable version (Vector.Mutable) inside the ST monad (or IO) and do all your modifications just like you would in an imperative language. When you are done, you "freeze" your vector to turn in into the immutable structure you want to use with pure code.

My feeling is that you should default to using Data.Sequence if a list is not appropriate. Use Data.Vector only if your usage pattern doesn't involve making many modifications, or if you need extremely high performance within the ST/IO monads.

If all this talk of the ST monad is leaving you confused: all the more reason to stick to pure fast and beautiful Data.Sequence.

Related Solutions

Haskell – Getting started with Haskell

I'm going to order this guide by the level of skill you have in Haskell, going from an absolute beginner right up to an expert. Note that this process will take many months (years?), so it is rather long.

Absolute Beginner

Firstly, Haskell is capable of anything, with enough skill. It is very fast (behind only C and C++ in my experience), and can be used for anything from simulations to servers, guis and web applications.

However there are some problems that are easier to write for a beginner in Haskell than others. Mathematical problems and list process programs are good candidates for this, as they only require the most basic of Haskell knowledge to be able to write.

Some good guides to learning the very basics of Haskell are the Happy Learn Haskell Tutorial and the first 6 chapters of Learn You a Haskell for Great Good (or its JupyterLab adaptation). While reading these, it is a very good idea to also be solving simple problems with what you know.

Another two good resources are Haskell Programming from first principles, and Programming in Haskell. They both come with exercises for each chapter, so you have small simple problems matching what you learned on the last few pages.

A good list of problems to try is the haskell 99 problems page. These start off very basic, and get more difficult as you go on. It is very good practice doing a lot of those, as they let you practice your skills in recursion and higher order functions. I would recommend skipping any problems that require randomness as that is a bit more difficult in Haskell. Check this SO question in case you want to test your solutions with QuickCheck (see Intermediate below).

Once you have done a few of those, you could move on to doing a few of the Project Euler problems. These are sorted by how many people have completed them, which is a fairly good indication of difficulty. These test your logic and Haskell more than the previous problems, but you should still be able to do the first few. A big advantage Haskell has with these problems is Integers aren't limited in size. To complete some of these problems, it will be useful to have read chapters 7 and 8 of learn you a Haskell as well.

Beginner

After that you should have a fairly good handle on recursion and higher order functions, so it would be a good time to start doing some more real world problems. A very good place to start is Real World Haskell (online book, you can also purchase a hard copy). I found the first few chapters introduced too much too quickly for someone who has never done functional programming/used recursion before. However with the practice you would have had from doing the previous problems you should find it perfectly understandable.

Working through the problems in the book is a great way of learning how to manage abstractions and building reusable components in Haskell. This is vital for people used to object-orientated (oo) programming, as the normal oo abstraction methods (oo classes) don't appear in Haskell (Haskell has type classes, but they are very different to oo classes, more like oo interfaces). I don't think it is a good idea to skip chapters, as each introduces a lot new ideas that are used in later chapters.

After a while you will get to chapter 14, the dreaded monads chapter (dum dum dummmm). Almost everyone who learns Haskell has trouble understanding monads, due to how abstract the concept is. I can't think of any concept in another language that is as abstract as monads are in functional programming. Monads allows many ideas (such as IO operations, computations that might fail, parsing,...) to be unified under one idea. So don't feel discouraged if after reading the monads chapter you don't really understand them. I found it useful to read many different explanations of monads; each one gives a new perspective on the problem. Here is a very good list of monad tutorials. I highly recommend the All About Monads, but the others are also good.

Also, it takes a while for the concepts to truly sink in. This comes through use, but also through time. I find that sometimes sleeping on a problem helps more than anything else! Eventually, the idea will click, and you will wonder why you struggled to understand a concept that in reality is incredibly simple. It is awesome when this happens, and when it does, you might find Haskell to be your favorite imperative programming language :)

To make sure that you are understanding Haskell type system perfectly, you should try to solve 20 intermediate haskell exercises. Those exercises using fun names of functions like "furry" and "banana" and helps you to have a good understanding of some basic functional programming concepts if you don't have them already. Nice way to spend your evening with a bunch of papers covered with arrows, unicorns, sausages and furry bananas.

Intermediate

Once you understand Monads, I think you have made the transition from a beginner Haskell programmer to an intermediate haskeller. So where to go from here? The first thing I would recommend (if you haven't already learnt them from learning monads) is the various types of monads, such as Reader, Writer and State. Again, Real world Haskell and All about monads gives great coverage of this. To complete your monad training learning about monad transformers is a must. These let you combine different types of Monads (such as a Reader and State monad) into one. This may seem useless to begin with, but after using them for a while you will wonder how you lived without them.

Now you can finish the real world Haskell book if you want. Skipping chapters now doesn't really matter, as long as you have monads down pat. Just choose what you are interested in.

With the knowledge you would have now, you should be able to use most of the packages on cabal (well the documented ones at least...), as well as most of the libraries that come with Haskell. A list of interesting libraries to try would be:

Parsec: for parsing programs and text. Much better than using regexps. Excellent documentation, also has a real world Haskell chapter.
QuickCheck: A very cool testing program. What you do is write a predicate that should always be true (eg length (reverse lst) == length lst). You then pass the predicate the QuickCheck, and it will generate a lot of random values (in this case lists) and test that the predicate is true for all results. See also the online manual.
HUnit: Unit testing in Haskell.
gtk2hs: The most popular gui framework for Haskell, lets you write gtk applications.
happstack: A web development framework for Haskell. Doesn't use databases, instead a data type store. Pretty good docs (other popular frameworks would be snap and yesod).

Also, there are many concepts (like the Monad concept) that you should eventually learn. This will be easier than learning Monads the first time, as your brain will be used to dealing with the level of abstraction involved. A very good overview for learning about these high level concepts and how they fit together is the Typeclassopedia.

Applicative: An interface like Monads, but less powerful. Every Monad is Applicative, but not vice versa. This is useful as there are some types that are Applicative but are not Monads. Also, code written using the Applicative functions is often more composable than writing the equivalent code using the Monad functions. See Functors, Applicative Functors and Monoids from the learn you a haskell guide.
Foldable,Traversable: Typeclasses that abstract many of the operations of lists, so that the same functions can be applied to other container types. See also the haskell wiki explanation.
Monoid: A Monoid is a type that has a zero (or mempty) value, and an operation, notated <> that joins two Monoids together, such that x <> mempty = mempty <> x = x and x <> (y <> z) = (x <> y) <> z. These are called identity and associativity laws. Many types are Monoids, such as numbers, with mempty = 0 and <> = +. This is useful in many situations.
Arrows: Arrows are a way of representing computations that take an input and return an output. A function is the most basic type of arrow, but there are many other types. The library also has many very useful functions for manipulating arrows - they are very useful even if only used with plain old Haskell functions.
Arrays: the various mutable/immutable arrays in Haskell.
ST Monad: lets you write code with a mutable state that runs very quickly, while still remaining pure outside the monad. See the link for more details.
FRP: Functional Reactive Programming, a new, experimental way of writing code that handles events, triggers, inputs and outputs (such as a gui). I don't know much about this though. Paul Hudak's talk about yampa is a good start.

There are a lot of new language features you should have a look at. I'll just list them, you can find lots of info about them from google, the haskell wikibook, the haskellwiki.org site and ghc documentation.

Multiparameter type classes/functional dependencies
Type families
Existentially quantified types
Phantom types
GADTS
others...

A lot of Haskell is based around category theory, so you may want to look into that. A good starting point is Category Theory for Computer Scientist. If you don't want to buy the book, the author's related article is also excellent.

Finally you will want to learn more about the various Haskell tools. These include:

ghc (and all its features)
cabal: the Haskell package system
darcs: a distributed version control system written in Haskell, very popular for Haskell programs.
haddock: a Haskell automatic documentation generator

While learning all these new libraries and concepts, it is very useful to be writing a moderate-sized project in Haskell. It can be anything (e.g. a small game, data analyser, website, compiler). Working on this will allow you to apply many of the things you are now learning. You stay at this level for ages (this is where I'm at).

Expert

It will take you years to get to this stage (hello from 2009!), but from here I'm guessing you start writing phd papers, new ghc extensions, and coming up with new abstractions.

Getting Help

Finally, while at any stage of learning, there are multiple places for getting information. These are:

the #haskell irc channel
the mailing lists. These are worth signing up for just to read the discussions that take place - some are very interesting.
other places listed on the haskell.org home page

Conclusion

Well this turned out longer than I expected... Anyway, I think it is a very good idea to become proficient in Haskell. It takes a long time, but that is mainly because you are learning a completely new way of thinking by doing so. It is not like learning Ruby after learning Java, but like learning Java after learning C. Also, I am finding that my object-orientated programming skills have improved as a result of learning Haskell, as I am seeing many new ways of abstracting ideas.

Python – Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

Using GHC 7.0.3, gcc 4.4.6, Linux 2.6.29 on an x86_64 Core2 Duo (2.5GHz) machine, compiling using ghc -O2 -fllvm -fforce-recomp for Haskell and gcc -O3 -lm for C.

Your C routine runs in 8.4 seconds (faster than your run probably because of -O3)
The Haskell solution runs in 36 seconds (due to the -O2 flag)
Your factorCount' code isn't explicitly typed and defaulting to Integer (thanks to Daniel for correcting my misdiagnosis here!). Giving an explicit type signature (which is standard practice anyway) using Int and the time changes to 11.1 seconds
in factorCount' you have needlessly called fromIntegral. A fix results in no change though (the compiler is smart, lucky for you).
You used mod where rem is faster and sufficient. This changes the time to 8.5 seconds.
factorCount' is constantly applying two extra arguments that never change (number, sqrt). A worker/wrapper transformation gives us:

 $ time ./so
 842161320  

 real    0m7.954s  
 user    0m7.944s  
 sys     0m0.004s

That's right, 7.95 seconds. Consistently half a second faster than the C solution. Without the -fllvm flag I'm still getting 8.182 seconds, so the NCG backend is doing well in this case too.

Conclusion: Haskell is awesome.

Resulting Code

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
  where
  go candidate count
    | candidate > sqrt = count
    | number `rem` candidate == 0 = go (candidate + 1) (count + 2)
    | otherwise = go (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

EDIT: So now that we've explored that, lets address the questions

Question 1: Do erlang, python and haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

In Haskell, using Integer is slower than Int but how much slower depends on the computations performed. Luckily (for 64 bit machines) Int is sufficient. For portability sake you should probably rewrite my code to use Int64 or Word64 (C isn't the only language with a long).

Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

That was what I answered above. The answer was

0) Use optimization via -O2
1) Use fast (notably: unbox-able) types when possible
2) rem not mod (a frequently forgotten optimization) and
3) worker/wrapper transformation (perhaps the most common optimization).

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

Yes, that wasn't the issue. Good work and glad you considered this.