Haskell – Specialization with Constraints

ghchaskell

I'm having problems getting GHC to specialize a function with a class constraint. I have a minimal example of my problem here: Foo.hs and Main.hs. The two files compile (GHC 7.6.2, ghc -O3 Main) and run.

NOTE:
Foo.hs is really stripped down. If you want to see why the constraint is needed, you can see a little more code here. If I put the code in a single file or make many other minor changes, GHC simply inlines the call to plusFastCyc. This will not happen in the real code because plusFastCyc is too large for GHC to inline, even when marked INLINE. The point is to specialize the call to plusFastCyc, not inline it. plusFastCyc is called in many places in the real code, so duplicating such a large function would not be desirable even if I could force GHC to do it.

The code of interest is the plusFastCyc in Foo.hs, reproduced here:

{-# INLINEABLE plusFastCyc #-}
{-# SPECIALIZE plusFastCyc :: 
         forall m . (Factored m Int) => 
              (FastCyc (VT U.Vector m) Int) -> 
                   (FastCyc (VT U.Vector m) Int) -> 
                        (FastCyc (VT U.Vector m) Int) #-}

-- Although the next specialization makes `fcTest` fast,
-- it isn't useful to me in my real program because the phantom type M is reified
-- {-# SPECIALIZE plusFastCyc :: 
--          FastCyc (VT U.Vector M) Int -> 
--               FastCyc (VT U.Vector M) Int -> 
--                    FastCyc (VT U.Vector M) Int #-}

plusFastCyc :: (Num (t r)) => (FastCyc t r) -> (FastCyc t r) -> (FastCyc t r)
plusFastCyc (PowBasis v1) (PowBasis v2) = PowBasis $ v1 + v2

The Main.hs file has two drivers: vtTest, which runs in ~3 seconds, and fcTest, which runs in ~83 seconds when compiled with -O3 using the forall'd specialization.

The core shows that for the vtTest test, the addition code is being specialized to Unboxed vectors over Ints, etc, while generic vector code is used for fcTest.
On line 10, you can see that GHC does write a specialized version of plusFastCyc, compared to the generic version on line 167.
The rule for the specialization is on line 225. I believe this rule should fire on line 270. (main6 calls iterate main8 y, so main8 is where plusFastCyc should be specialized.)

My goal is to make fcTest as fast as vtTest by specializing plusFastCyc. I've found two ways to do this:

Explicity call inline from GHC.Exts in fcTest.
Remove the Factored m Int constraint on plusFastCyc.

Option 1 is unsatisfactory because in the actual code base plusFastCyc is a frequently used operation and a very large function, so it should not be inlined at every use. Rather, GHC should call a specialized version of plusFastCyc. Option 2 is not really an option because I need the constraint in the real code.

I've tried a variety of options using (and not using) INLINE, INLINABLE, and SPECIALIZE, but nothing seems to work. (EDIT: I may have stripped out too much of plusFastCyc to make my example small, so INLINE might cause the function to be inlined. This doesn't happen in my real code because plusFastCyc is so large.) In this particular example, I'm not getting any match_co: needs more cases or RULE: LHS too complicated to desugar (and here) warnings, though I was getting many match_co warnings before minimizing the example. Presumably, the "problem" is the Factored m Int constraint in the rule; if I make changes to that constraint, fcTest runs as fast as vtTest.

Am I doing something GHC just doesn't like? Why won't GHC specialize the plusFastCyc, and how can I make it?

UPDATE

The problem persists in GHC 7.8.2, so this question is still relevant.

Best Answer

GHC also gives an option to SPECIALIZE a type-class instance declaration. I tried this with the (expanded) code of Foo.hs, by putting the following:

instance (Num r, V.Vector v r, Factored m r) => Num (VT v m r) where 
    {-# SPECIALIZE instance ( Factored m Int => Num (VT U.Vector m Int)) #-}
    VT x + VT y = VT $ V.zipWith (+) x y

This change, though, did not achieve the desired speedup. What did achieve that performance improvement was manually adding a specialized instance for the type VT U.Vector m Int with the same function definitions, as follows:

instance (Factored m Int) => Num (VT U.Vector m Int) where 
    VT x + VT y = VT $ V.zipWith (+) x y

This requires adding OverlappingInstances and FlexibleInstances in LANGUAGE.

Interestingly, in the example program, the speedup obtained with the overlapping instance remains even if you remove every SPECIALIZE and INLINABLE pragma.

Related Solutions

Haskell – Getting started with Haskell

I'm going to order this guide by the level of skill you have in Haskell, going from an absolute beginner right up to an expert. Note that this process will take many months (years?), so it is rather long.

Absolute Beginner

Firstly, Haskell is capable of anything, with enough skill. It is very fast (behind only C and C++ in my experience), and can be used for anything from simulations to servers, guis and web applications.

However there are some problems that are easier to write for a beginner in Haskell than others. Mathematical problems and list process programs are good candidates for this, as they only require the most basic of Haskell knowledge to be able to write.

Some good guides to learning the very basics of Haskell are the Happy Learn Haskell Tutorial and the first 6 chapters of Learn You a Haskell for Great Good (or its JupyterLab adaptation). While reading these, it is a very good idea to also be solving simple problems with what you know.

Another two good resources are Haskell Programming from first principles, and Programming in Haskell. They both come with exercises for each chapter, so you have small simple problems matching what you learned on the last few pages.

A good list of problems to try is the haskell 99 problems page. These start off very basic, and get more difficult as you go on. It is very good practice doing a lot of those, as they let you practice your skills in recursion and higher order functions. I would recommend skipping any problems that require randomness as that is a bit more difficult in Haskell. Check this SO question in case you want to test your solutions with QuickCheck (see Intermediate below).

Once you have done a few of those, you could move on to doing a few of the Project Euler problems. These are sorted by how many people have completed them, which is a fairly good indication of difficulty. These test your logic and Haskell more than the previous problems, but you should still be able to do the first few. A big advantage Haskell has with these problems is Integers aren't limited in size. To complete some of these problems, it will be useful to have read chapters 7 and 8 of learn you a Haskell as well.

Beginner

After that you should have a fairly good handle on recursion and higher order functions, so it would be a good time to start doing some more real world problems. A very good place to start is Real World Haskell (online book, you can also purchase a hard copy). I found the first few chapters introduced too much too quickly for someone who has never done functional programming/used recursion before. However with the practice you would have had from doing the previous problems you should find it perfectly understandable.

Working through the problems in the book is a great way of learning how to manage abstractions and building reusable components in Haskell. This is vital for people used to object-orientated (oo) programming, as the normal oo abstraction methods (oo classes) don't appear in Haskell (Haskell has type classes, but they are very different to oo classes, more like oo interfaces). I don't think it is a good idea to skip chapters, as each introduces a lot new ideas that are used in later chapters.

After a while you will get to chapter 14, the dreaded monads chapter (dum dum dummmm). Almost everyone who learns Haskell has trouble understanding monads, due to how abstract the concept is. I can't think of any concept in another language that is as abstract as monads are in functional programming. Monads allows many ideas (such as IO operations, computations that might fail, parsing,...) to be unified under one idea. So don't feel discouraged if after reading the monads chapter you don't really understand them. I found it useful to read many different explanations of monads; each one gives a new perspective on the problem. Here is a very good list of monad tutorials. I highly recommend the All About Monads, but the others are also good.

Also, it takes a while for the concepts to truly sink in. This comes through use, but also through time. I find that sometimes sleeping on a problem helps more than anything else! Eventually, the idea will click, and you will wonder why you struggled to understand a concept that in reality is incredibly simple. It is awesome when this happens, and when it does, you might find Haskell to be your favorite imperative programming language :)

To make sure that you are understanding Haskell type system perfectly, you should try to solve 20 intermediate haskell exercises. Those exercises using fun names of functions like "furry" and "banana" and helps you to have a good understanding of some basic functional programming concepts if you don't have them already. Nice way to spend your evening with a bunch of papers covered with arrows, unicorns, sausages and furry bananas.

Intermediate

Once you understand Monads, I think you have made the transition from a beginner Haskell programmer to an intermediate haskeller. So where to go from here? The first thing I would recommend (if you haven't already learnt them from learning monads) is the various types of monads, such as Reader, Writer and State. Again, Real world Haskell and All about monads gives great coverage of this. To complete your monad training learning about monad transformers is a must. These let you combine different types of Monads (such as a Reader and State monad) into one. This may seem useless to begin with, but after using them for a while you will wonder how you lived without them.

Now you can finish the real world Haskell book if you want. Skipping chapters now doesn't really matter, as long as you have monads down pat. Just choose what you are interested in.

With the knowledge you would have now, you should be able to use most of the packages on cabal (well the documented ones at least...), as well as most of the libraries that come with Haskell. A list of interesting libraries to try would be:

Parsec: for parsing programs and text. Much better than using regexps. Excellent documentation, also has a real world Haskell chapter.
QuickCheck: A very cool testing program. What you do is write a predicate that should always be true (eg length (reverse lst) == length lst). You then pass the predicate the QuickCheck, and it will generate a lot of random values (in this case lists) and test that the predicate is true for all results. See also the online manual.
HUnit: Unit testing in Haskell.
gtk2hs: The most popular gui framework for Haskell, lets you write gtk applications.
happstack: A web development framework for Haskell. Doesn't use databases, instead a data type store. Pretty good docs (other popular frameworks would be snap and yesod).

Also, there are many concepts (like the Monad concept) that you should eventually learn. This will be easier than learning Monads the first time, as your brain will be used to dealing with the level of abstraction involved. A very good overview for learning about these high level concepts and how they fit together is the Typeclassopedia.

Applicative: An interface like Monads, but less powerful. Every Monad is Applicative, but not vice versa. This is useful as there are some types that are Applicative but are not Monads. Also, code written using the Applicative functions is often more composable than writing the equivalent code using the Monad functions. See Functors, Applicative Functors and Monoids from the learn you a haskell guide.
Foldable,Traversable: Typeclasses that abstract many of the operations of lists, so that the same functions can be applied to other container types. See also the haskell wiki explanation.
Monoid: A Monoid is a type that has a zero (or mempty) value, and an operation, notated <> that joins two Monoids together, such that x <> mempty = mempty <> x = x and x <> (y <> z) = (x <> y) <> z. These are called identity and associativity laws. Many types are Monoids, such as numbers, with mempty = 0 and <> = +. This is useful in many situations.
Arrows: Arrows are a way of representing computations that take an input and return an output. A function is the most basic type of arrow, but there are many other types. The library also has many very useful functions for manipulating arrows - they are very useful even if only used with plain old Haskell functions.
Arrays: the various mutable/immutable arrays in Haskell.
ST Monad: lets you write code with a mutable state that runs very quickly, while still remaining pure outside the monad. See the link for more details.
FRP: Functional Reactive Programming, a new, experimental way of writing code that handles events, triggers, inputs and outputs (such as a gui). I don't know much about this though. Paul Hudak's talk about yampa is a good start.

There are a lot of new language features you should have a look at. I'll just list them, you can find lots of info about them from google, the haskell wikibook, the haskellwiki.org site and ghc documentation.

Multiparameter type classes/functional dependencies
Type families
Existentially quantified types
Phantom types
GADTS
others...

A lot of Haskell is based around category theory, so you may want to look into that. A good starting point is Category Theory for Computer Scientist. If you don't want to buy the book, the author's related article is also excellent.

Finally you will want to learn more about the various Haskell tools. These include:

ghc (and all its features)
cabal: the Haskell package system
darcs: a distributed version control system written in Haskell, very popular for Haskell programs.
haddock: a Haskell automatic documentation generator

While learning all these new libraries and concepts, it is very useful to be writing a moderate-sized project in Haskell. It can be anything (e.g. a small game, data analyser, website, compiler). Working on this will allow you to apply many of the things you are now learning. You stay at this level for ages (this is where I'm at).

Expert

It will take you years to get to this stage (hello from 2009!), but from here I'm guessing you start writing phd papers, new ghc extensions, and coming up with new abstractions.

Getting Help

Finally, while at any stage of learning, there are multiple places for getting information. These are:

the #haskell irc channel
the mailing lists. These are worth signing up for just to read the discussions that take place - some are very interesting.
other places listed on the haskell.org home page

Conclusion

Well this turned out longer than I expected... Anyway, I think it is a very good idea to become proficient in Haskell. It takes a long time, but that is mainly because you are learning a completely new way of thinking by doing so. It is not like learning Ruby after learning Java, but like learning Java after learning C. Also, I am finding that my object-orientated programming skills have improved as a result of learning Haskell, as I am seeing many new ways of abstracting ideas.

Haskell – What optimizations can GHC be expected to perform reliably

This GHC Trac page also explains the passes fairly well. This page explains the optimization ordering, though, like the majority of the Trac Wiki, it is out of date.

For specifics, the best thing to do is probably to look at how a specific program is compiled. The best way to see which optimizations are being performed is to compile the program verbosely, using the -v flag. Taking as an example the first piece of Haskell I could find on my computer:

Glasgow Haskell Compiler, Version 7.4.2, stage 2 booted by GHC version 7.4.1
Using binary package database: /usr/lib/ghc-7.4.2/package.conf.d/package.cache
wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-7d3c2c69a5e8257a04b2c679c40e2fa7
wired-in package integer-gmp mapped to integer-gmp-0.4.0.0-af3a28fdc4138858e0c7c5ecc2a64f43
wired-in package base mapped to base-4.5.1.0-6e4c9bdc36eeb9121f27ccbbcb62e3f3
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to template-haskell-2.7.0.0-2bd128e15c2d50997ec26a1eaf8b23bf
wired-in package dph-seq not found.
wired-in package dph-par not found.
Hsc static flags: -static
*** Chasing dependencies:
Chasing modules from: *SleepSort.hs
Stable obj: [Main]
Stable BCO: []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = Tue Oct 18 22:22:11 CDT 2011
         ms_mod = main:Main,
         ms_textual_imps = [import (implicit) Prelude, import Control.Monad,
                            import Control.Concurrent, import System.Environment]
         ms_srcimps = []
      }]
*** Deleting temp files:
Deleting: 
compile: input file SleepSort.hs
Created temporary directory: /tmp/ghc4784_0
*** Checking old interface for main:Main:
[1 of 1] Compiling Main             ( SleepSort.hs, SleepSort.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization) = 79
*** Simplifier:
Result size of Simplifier iteration=1 = 87
Result size of Simplifier iteration=2 = 93
Result size of Simplifier iteration=3 = 83
Result size of Simplifier = 83
*** Specialise:
Result size of Specialise = 83
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False}) = 95
*** Float inwards:
Result size of Float inwards = 95
*** Simplifier:
Result size of Simplifier iteration=1 = 253
Result size of Simplifier iteration=2 = 229
Result size of Simplifier = 229
*** Simplifier:
Result size of Simplifier iteration=1 = 218
Result size of Simplifier = 218
*** Simplifier:
Result size of Simplifier iteration=1 = 283
Result size of Simplifier iteration=2 = 226
Result size of Simplifier iteration=3 = 202
Result size of Simplifier = 202
*** Demand analysis:
Result size of Demand analysis = 202
*** Worker Wrapper binds:
Result size of Worker Wrapper binds = 202
*** Simplifier:
Result size of Simplifier = 202
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True}) = 210
*** Common sub-expression:
Result size of Common sub-expression = 210
*** Float inwards:
Result size of Float inwards = 210
*** Liberate case:
Result size of Liberate case = 210
*** Simplifier:
Result size of Simplifier iteration=1 = 206
Result size of Simplifier = 206
*** SpecConstr:
Result size of SpecConstr = 206
*** Simplifier:
Result size of Simplifier = 206
*** Tidy Core:
Result size of Tidy Core = 206
writeBinIface: 4 Names
writeBinIface: 28 dict entries
*** CorePrep:
Result size of CorePrep = 224
*** Stg2Stg:
*** CodeGen:
*** CodeOutput:
*** Assembler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-I.' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' 'SleepSort.o'
Upsweep completely successful.
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_0.c /tmp/ghc4784_0/ghc4784_0.s
Warning: deleting non-existent /tmp/ghc4784_0/ghc4784_0.c
link: linkables are ...
LinkableM (Sat Sep 29 20:21:02 CDT 2012) main:Main
   [DotO SleepSort.o]
Linking SleepSort ...
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.c' '-o' '/tmp/ghc4784_0/ghc4784_0.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' '/tmp/ghc4784_0/ghc4784_1.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** Linker:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-o' 'SleepSort' 'SleepSort.o' '-L/usr/lib/ghc-7.4.2/base-4.5.1.0' '-L/usr/lib/ghc-7.4.2/integer-gmp-0.4.0.0' '-L/usr/lib/ghc-7.4.2/ghc-prim-0.2.0.0' '-L/usr/lib/ghc-7.4.2' '/tmp/ghc4784_0/ghc4784_0.o' '/tmp/ghc4784_0/ghc4784_1.o' '-lHSbase-4.5.1.0' '-lHSinteger-gmp-0.4.0.0' '-lgmp' '-lHSghc-prim-0.2.0.0' '-lHSrts' '-lm' '-lrt' '-ldl' '-u' 'ghczmprim_GHCziTypes_Izh_static_info' '-u' 'ghczmprim_GHCziTypes_Czh_static_info' '-u' 'ghczmprim_GHCziTypes_Fzh_static_info' '-u' 'ghczmprim_GHCziTypes_Dzh_static_info' '-u' 'base_GHCziPtr_Ptr_static_info' '-u' 'base_GHCziWord_Wzh_static_info' '-u' 'base_GHCziInt_I8zh_static_info' '-u' 'base_GHCziInt_I16zh_static_info' '-u' 'base_GHCziInt_I32zh_static_info' '-u' 'base_GHCziInt_I64zh_static_info' '-u' 'base_GHCziWord_W8zh_static_info' '-u' 'base_GHCziWord_W16zh_static_info' '-u' 'base_GHCziWord_W32zh_static_info' '-u' 'base_GHCziWord_W64zh_static_info' '-u' 'base_GHCziStable_StablePtr_static_info' '-u' 'ghczmprim_GHCziTypes_Izh_con_info' '-u' 'ghczmprim_GHCziTypes_Czh_con_info' '-u' 'ghczmprim_GHCziTypes_Fzh_con_info' '-u' 'ghczmprim_GHCziTypes_Dzh_con_info' '-u' 'base_GHCziPtr_Ptr_con_info' '-u' 'base_GHCziPtr_FunPtr_con_info' '-u' 'base_GHCziStable_StablePtr_con_info' '-u' 'ghczmprim_GHCziTypes_False_closure' '-u' 'ghczmprim_GHCziTypes_True_closure' '-u' 'base_GHCziPack_unpackCString_closure' '-u' 'base_GHCziIOziException_stackOverflow_closure' '-u' 'base_GHCziIOziException_heapOverflow_closure' '-u' 'base_ControlziExceptionziBase_nonTermination_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnMVar_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnSTM_closure' '-u' 'base_ControlziExceptionziBase_nestedAtomically_closure' '-u' 'base_GHCziWeak_runFinalizzerBatch_closure' '-u' 'base_GHCziTopHandler_flushStdHandles_closure' '-u' 'base_GHCziTopHandler_runIO_closure' '-u' 'base_GHCziTopHandler_runNonIO_closure' '-u' 'base_GHCziConcziIO_ensureIOManagerIsRunning_closure' '-u' 'base_GHCziConcziSync_runSparks_closure' '-u' 'base_GHCziConcziSignal_runHandlers_closure'
link: done
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_1.o /tmp/ghc4784_0/ghc4784_0.s /tmp/ghc4784_0/ghc4784_0.o /tmp/ghc4784_0/ghc4784_0.c
*** Deleting temp dirs:
Deleting: /tmp/ghc4784_0

Looking from the first *** Simplifier: to the last, where all the optimization phases happen, we see quite a lot.

First of all, the Simplifier runs between almost all the phases. This makes writing many passes much easier. For example, when implementing many optimizations, they simply create rewrite rules to propagate the changes instead of having to do it manually. The simplifier encompasses a number of simple optimizations, including inlining and fusion. The main limitation of this that I know is that GHC refuses to inline recursive functions, and that things have to be named correctly for fusion to work.

Next, we see a full list of all the optimizations performed:

Specialise

The basic idea of specialization is to remove polymorphism and overloading by identifying places where the function is called and creating versions of the function that aren't polymorphic - they are specific to the types they are called with. You can also tell the compiler to do this with the SPECIALISE pragma. As an example, take a factorial function:
```
fac :: (Num a, Eq a) => a -> a
fac 0 = 1
fac n = n * fac (n - 1)
```
As the compiler doesn't know any properties of the multiplication that is to be used, it cannot optimize this at all. If however, it sees that it is used on an Int, it now can create a new version, differing only in the type:
```
fac_Int :: Int -> Int
fac_Int 0 = 1
fac_Int n = n * fac_Int (n - 1)
```
Next, rules mentioned below can fire, and you end up with something working on unboxed Ints, which is much faster than the original. Another way to look at specialisation is partial application on type class dictionaries and type variables.

The source here has a load of notes in it.
Float out

EDIT: I apparently misunderstood this before. My explanation has completely changed.

The basic idea of this is to move computations that shouldn't be repeated out of functions. For example, suppose we had this:
```
\x -> let y = expensive in x+y
```
In the above lambda, every time the function is called, y is recomputed. A better function, which floating out produces, is
```
let y = expensive in \x -> x+y
```
To facilitate the process, other transformations may be applied. For example, this happens:
```
 \x -> x + f 2
 \x -> x + let f_2 = f 2 in f_2
 \x -> let f_2 = f 2 in x + f_2
 let f_2 = f 2 in \x -> x + f_2
```
Again, repeated computation is saved.

The source is very readable in this case.

At the moment bindings between two adjacent lambdas are not floated. For example, this does not happen:
```
\x y -> let t = x+x in ...
```
going to
```
 \x -> let t = x+x in \y -> ...
```
Float inwards

Quoting the source code,

The main purpose of floatInwards is floating into branches of a case, so that we don't allocate things, save them on the stack, and then discover that they aren't needed in the chosen branch.

As an example, suppose we had this expression:
```
let x = big in
    case v of
        True -> x + 1
        False -> 0
```
If v evaluates to False, then by allocating x, which is presumably some big thunk, we have wasted time and space. Floating inwards fixes this, producing this:
```
case v of
    True -> let x = big in x + 1
    False -> let x = big in 0
```
, which is subsequently replaced by the simplifier with
```
case v of
    True -> big + 1
    False -> 0
```
This paper, although covering other topics, gives a fairly clear introduction. Note that despite their names, floating in and floating out don't get in an infinite loop for two reasons:
1. Float in floats lets into case statements, while float out deals with functions.
2. There is a fixed order of passes, so they shouldn't be alternating infinitely.

Demand analysis

Demand analysis, or strictness analysis is less of a transformation and more, like the name suggests, of an information gathering pass. The compiler finds functions that always evaluate their arguments (or at least some of them), and passes those arguments using call-by-value, instead of call-by-need. Since you get to evade the overheads of thunks, this is often much faster. Many performance problems in Haskell arise from either this pass failing, or code simply not being strict enough. A simple example is the difference between using foldr, foldl, and foldl' to sum a list of integers - the first causes stack overflow, the second causes heap overflow, and the last runs fine, because of strictness. This is probably the easiest to understand and best documented of all of these. I believe that polymorphism and CPS code often defeat this.
Worker Wrapper binds

The basic idea of the worker/wrapper transformation is to do a tight loop on a simple structure, converting to and from that structure at the ends. For example, take this function, which calculates the factorial of a number.
```
factorial :: Int -> Int
factorial 0 = 1
factorial n = n * factorial (n - 1)
```
Using the definition of Int in GHC, we have
```
factorial :: Int -> Int
factorial (I# 0#) = I# 1#
factorial (I# n#) = I# (n# *# case factorial (I# (n# -# 1#)) of
    I# down# -> down#)
```
Notice how the code is covered in I#s? We can remove them by doing this:
```
factorial :: Int -> Int
factorial (I# n#) = I# (factorial# n#)

factorial# :: Int# -> Int#
factorial# 0# = 1#
factorial# n# = n# *# factorial# (n# -# 1#)
```
Although this specific example could have also been done by SpecConstr, the worker/wrapper transformation is very general in the things it can do.
Common sub-expression

This is another really simple optimization that is very effective, like strictness analysis. The basic idea is that if you have two expressions that are the same, they will have the same value. For example, if fib is a Fibonacci number calculator, CSE will transform
```
fib x + fib x
```
into
```
let fib_x = fib x in fib_x + fib_x
```
which cuts the computation in half. Unfortunately, this can occasionally get in the way of other optimizations. Another problem is that the two expressions have to be in the same place and that they have to be syntactically the same, not the same by value. For example, CSE won't fire in the following code without a bunch of inlining:
```
x = (1 + (2 + 3)) + ((1 + 2) + 3)
y = f x
z = g (f x) y
```
However, if you compile via llvm, you may get some of this combined, due to its Global Value Numbering pass.
Liberate case

This seems to be a terribly documented transformation, besides the fact that it can cause code explosion. Here is a reformatted (and slightly rewritten) version of the little documentation I found:

This module walks over Core, and looks for case on free variables. The criterion is: if there is a case on a free variable on the route to the recursive call, then the recursive call is replaced with an unfolding. For example, in
```
f = \ t -> case v of V a b -> a : f t
```
the inner f is replaced. to make
```
f = \ t -> case v of V a b -> a : (letrec f = \ t -> case v of V a b -> a : f t in f) t
```
Note the need for shadowing. Simplifying, we get
```
f = \ t -> case v of V a b -> a : (letrec f = \ t -> a : f t in f t)
```
This is better code, because a is free inside the inner letrec, rather than needing projection from v. Note that this deals with free variables, unlike SpecConstr, which deals with arguments that are of known form.

See below for more information about SpecConstr.
SpecConstr - this transforms programs like
```
f (Left x) y = somthingComplicated1
f (Right x) y = somethingComplicated2
```
into
```
f_Left x y = somethingComplicated1
f_Right x y = somethingComplicated2

{-# INLINE f #-}
f (Left x) = f_Left x
f (Right x) = f_Right x
```
As an extended example, take this definition of last:
```
last [] = error "last: empty list"
last (x:[]) = x
last (x:x2:xs) = last (x2:xs)
```
We first transform it to
```
last_nil = error "last: empty list"
last_cons x [] = x
last_cons x (x2:xs) = last (x2:xs)

{-# INLINE last #-}
last [] = last_nil
last (x : xs) = last_cons x xs
```
Next, the simplifier runs, and we have
```
last_nil = error "last: empty list"
last_cons x [] = x
last_cons x (x2:xs) = last_cons x2 xs

{-# INLINE last #-}
last [] = last_nil
last (x : xs) = last_cons x xs
```
Note that the program is now faster, as we are not repeatedly boxing and unboxing the front of the list. Also note that the inlining is crucial, as it allows the new, more efficient definitions to actually be used, as well as making recursive definitions better.

SpecConstr is controlled by a number of heuristics. The ones mentioned in the paper are as such:
1. The lambdas are explicit and the arity is a.
2. The right hand side is "sufficiently small," something controlled by a flag.
3. The function is recursive, and the specializable call is used in the right hand side.
4. All of the arguments to the function are present.
5. At least one of the arguments is a constructor application.
6. That argument is case-analysed somewhere in the function.
However, the heuristics have almost certainly changed. In fact, the paper mentions an alternative sixth heuristic:

Specialise on an argument x only if x is only scrutinised by a case, and is not passed to an ordinary function, or returned as part of the result.

This was a very small file (12 lines) and so possibly didn't trigger that many optimizations (though I think it did them all). This also doesn't tell you why it picked those passes and why it put them in that order.

Best Answer

Related Solutions

Haskell – Getting started with Haskell

Haskell – What optimizations can GHC be expected to perform reliably

Related Topic