Scala – Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)

foldfunctional programmingreducescalascalding

Why do Scala and frameworks like Spark and Scalding have both reduce and foldLeft? So then what's the difference between reduce and fold?

Best Answer

reduce vs foldLeft

A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.

This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.

If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.

But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.

foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.

If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"

http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

Here is proof that reduce is NOT just a special case of foldLeft

scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par

scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds

scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds

reduce vs fold

Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.

There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.

Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.

It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce has become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.

So Spark does have a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that fold uses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce but no treeFold.

Conclusion

There is a difference between reduce and fold even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.

In practice due to the unmathematical nature of programming, reduce and fold tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).

Extra: My Opinion On the Spark API

My opinion is that confusion would be avoided if use of the term fold was completely dropped in Spark. At least spark does have a note in their documentation:

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.

tl;dr

class C defines a class, just as in Java or C++.
object O creates a singleton object O as instance of some anonymous class; it can be used to hold static members that are not associated with instances of some class.
object O extends T makes the object O an instance of trait T; you can then pass O anywhere, a T is expected.
if there is a class C, then object C is the companion object of class C; note that the companion object is not automatically an instance of C.

Also see Scala documentation for object and class.

`object` as host of static members

Most often, you need an object to hold methods and values/variables that shall be available without having to first instantiate an instance of some class. This use is closely related to static members in Java.

object A {
  def twice(i: Int): Int = 2*i
}

You can then call above method using A.twice(2).

If twice were a member of some class A, then you would need to make an instance first:

class A() {
  def twice(i: Int): Int = 2 * i
}

val a = new A()
a.twice(2)

You can see how redundant this is, as twice does not require any instance-specific data.

`object` as a special named instance

You can also use the object itself as some special instance of a class or trait. When you do this, your object needs to extend some trait in order to become an instance of a subclass of it.

Consider the following code:

object A extends B with C {
  ...
}

This declaration first declares an anonymous (inaccessible) class that extends both B and C, and instantiates a single instance of this class named A.

This means A can be passed to functions expecting objects of type B or C, or B with C.

Additional Features of `object`

There also exist some special features of objects in Scala. I recommend to read the official documentation.

def apply(...) enables the usual method name-less syntax of A(...)
def unapply(...) allows to create custom pattern matching extractors
if accompanying a class of the same name, the object assumes a special role when resolving implicit parameters

Scala – the difference between a var and val definition in Scala

As so many others have said, the object assigned to a val cannot be replaced, and the object assigned to a var can. However, said object can have its internal state modified. For example:

class A(n: Int) {
  var value = n
}

class B(n: Int) {
  val value = new A(n)
}

object Test {
  def main(args: Array[String]) {
    val x = new B(5)
    x = new B(6) // Doesn't work, because I can't replace the object created on the line above with this new one.
    x.value = new A(6) // Doesn't work, because I can't replace the object assigned to B.value for a new one.
    x.value.value = 6 // Works, because A.value can receive a new object.
  }
}

So, even though we can't change the object assigned to x, we could change the state of that object. At the root of it, however, there was a var.

Now, immutability is a good thing for many reasons. First, if an object doesn't change internal state, you don't have to worry if some other part of your code is changing it. For example:

x = new B(0)
f(x)
if (x.value.value == 0)
  println("f didn't do anything to x")
else
  println("f did something to x")

This becomes particularly important with multithreaded systems. In a multithreaded system, the following can happen:

x = new B(1)
f(x)
if (x.value.value == 1) {
  print(x.value.value) // Can be different than 1!
}

If you use val exclusively, and only use immutable data structures (that is, avoid arrays, everything in scala.collection.mutable, etc.), you can rest assured this won't happen. That is, unless there's some code, perhaps even a framework, doing reflection tricks -- reflection can change "immutable" values, unfortunately.

That's one reason, but there is another reason for it. When you use var, you can be tempted into reusing the same var for multiple purposes. This has some problems:

It will be more difficult for people reading the code to know what is the value of a variable in a certain part of the code.
You may forget to re-initialize the variable in some code path, and end up passing wrong values downstream in the code.

Simply put, using val is safer and leads to more readable code.

We can, then, go the other direction. If val is that better, why have var at all? Well, some languages did take that route, but there are situations in which mutability improves performance, a lot.

For example, take an immutable Queue. When you either enqueue or dequeue things in it, you get a new Queue object. How then, would you go about processing all items in it?

I'll go through that with an example. Let's say you have a queue of digits, and you want to compose a number out of them. For example, if I have a queue with 2, 1, 3, in that order, I want to get back the number 213. Let's first solve it with a mutable.Queue:

def toNum(q: scala.collection.mutable.Queue[Int]) = {
  var num = 0
  while (!q.isEmpty) {
    num *= 10
    num += q.dequeue
  }
  num
}

This code is fast and easy to understand. Its main drawback is that the queue that is passed is modified by toNum, so you have to make a copy of it beforehand. That's the kind of object management that immutability makes you free from.

Now, let's covert it to an immutable.Queue:

def toNum(q: scala.collection.immutable.Queue[Int]) = {
  def recurse(qr: scala.collection.immutable.Queue[Int], num: Int): Int = {
    if (qr.isEmpty)
      num
    else {
      val (digit, newQ) = qr.dequeue
      recurse(newQ, num * 10 + digit)
    }
  }
  recurse(q, 0)
}

Because I can't reuse some variable to keep track of my num, like in the previous example, I need to resort to recursion. In this case, it is a tail-recursion, which has pretty good performance. But that is not always the case: sometimes there is just no good (readable, simple) tail recursion solution.

Note, however, that I can rewrite that code to use an immutable.Queue and a var at the same time! For example:

def toNum(q: scala.collection.immutable.Queue[Int]) = {
  var qr = q
  var num = 0
  while (!qr.isEmpty) {
    val (digit, newQ) = qr.dequeue
    num *= 10
    num += digit
    qr = newQ
  }
  num
}

This code is still efficient, does not require recursion, and you don't need to worry whether you have to make a copy of your queue or not before calling toNum. Naturally, I avoided reusing variables for other purposes, and no code outside this function sees them, so I don't need to worry about their values changing from one line to the next -- except when I explicitly do so.

Scala opted to let the programmer do that, if the programmer deemed it to be the best solution. Other languages have chosen to make such code difficult. The price Scala (and any language with widespread mutability) pays is that the compiler doesn't have as much leeway in optimizing the code as it could otherwise. Java's answer to that is optimizing the code based on the run-time profile. We could go on and on about pros and cons to each side.

Personally, I think Scala strikes the right balance, for now. It is not perfect, by far. I think both Clojure and Haskell have very interesting notions not adopted by Scala, but Scala has its own strengths as well. We'll see what comes up on the future.

Best Answer

reduce vs foldLeft

reduce vs fold

Conclusion

Extra: My Opinion On the Spark API

Related Solutions

Scala – Difference between object and class in Scala

tl;dr

object as host of static members

object as a special named instance

Additional Features of object

Scala – the difference between a var and val definition in Scala

Related Topic

`object` as host of static members

`object` as a special named instance

Additional Features of `object`