Functional Programming – Methods on Collections in Scala

collectionsfunctional programmingfunctionshigher-order-functionsscala

I'm learning Scala and am a little bewildered by all the methods (higher-order functions) available on the collections. Which ones produce more results than the original collection, which ones produce less, and which are most appropriate for a given problem? Though I'm studying Scala, I think this would pertain to most modern functional languages (Clojure, Haskell) and also to Java 8 which introduces these methods on Java collections.

Specifically, right now I'm wondering about map with filter vs. fold/reduce. I was delighted that using foldRight() can yield the same result as a map(…).filter(…) with only one traversal of the underlying collection. But a friend pointed out that foldRight() may force sequential processing while map() is friendlier to being processed by multiple processors in parallel. Maybe this is why mapReduce() is so popular?

More generally, I'm still sometimes surprised when I chain several of these methods together to get back a List(List()) or to pass a List(List()) and get back just a List(). For instance, when would I use:

collection.map(a => a.map(b => ...))

vs.

collection.map(a => ...).map(b => ...)

The for/yield command does nothing to help this confusion. Am I asking about the difference between a "fold" and "unfold" operation?

Am I trying to jam too many questions into one? I think there may be an underlying concept that, if I understood it, might answer all these questions, or at least tie the answers together.

Best Answer

You can roughly emulate map/filter by using foldRight (or foldLeft) with cons as the given reduction function. For example foldRight(f, L, initial) where L = [x,y,z] might be expanded to

f(x, f(y, f(z, initial)))

This means that you can't process x until you've processed f(z, initial) and then f(y, ...). You're creating a dependency that doesn't exist with map/filter.

As for map ordering...

(A) collection.map(a => a.map(b => ...))

(A) takes a collection and then applies the map function to each element, this implies that each element is a "mappable" collection. This inner map will then return the processed list (which is a list) and so each element of collection will remain a collection. This is how you would map a function onto each element of a list of lists, and the result would again be a list of lists.

(B) collection.map(a => ...).map(b => ...)

(B) processes each element of a list and forms those results into a new list. This list is then processed again with a second map function giving yet another list.

(A) is for processing the inner elements of a list ("sub-elements" if you like). (B) is for processing a list multiple times. If we write (B) concretely as

collection.map(a => f(a)).map(b => g(b))

we can see it is equivalent to

collection.map(a => g(f(a)))

It might help you to write them out as for loops. (A) will use embedded loops where as (B) will use two sequential loops.

This is not the difference between fold and unfold. Neither (A) nor (B) is a fold as the list structure present, however deeply nested, is preserved. Fold creates a scalar from a list, while unfold (not too familiar with it) takes a scalar and produces a list according to some rule.

EDIT: @Giorgio was right to suggest flatMap. It's an interesting variation on map. Let's say we're mapping a list of Xs to a list of Ys, so we pass map a function f:X->Y. Suppose we have another calculation g that takes an X but returns multiple Ys g:X->[Y]. In this case we would use flatMap. map takes the results of f and puts them into a list. flatMap takes the results of g and concatenates them together.

Ex. Say we have a list of lists L = [[1,2,3],[4,5,6]]

L.map(a => a.map(b => b * 2))

gives [[2,4,6],[8,10,12]]. But say we want each number doubled in one list, no sub-lists. Then we would call

L.flatMap(a => a.map(b => b * 2))

which gives [2,4,6,8,10,12]. Note that our inner function a => a.map(b => b * 2) takes and returns a list.