Why do Scala and frameworks like Spark and Scalding have both reduce
and foldLeft
? So then what's the difference between reduce
and fold
?
Scala – Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)
foldfunctional programmingreducescalascalding
Related Solutions
tl;dr
class C
defines a class, just as in Java or C++.object O
creates a singleton objectO
as instance of some anonymous class; it can be used to hold static members that are not associated with instances of some class.object O extends T
makes the objectO
an instance oftrait T
; you can then passO
anywhere, aT
is expected.- if there is a
class C
, thenobject C
is the companion object of classC
; note that the companion object is not automatically an instance ofC
.
Also see Scala documentation for object and class.
object
as host of static members
Most often, you need an object
to hold methods and values/variables that shall be available without having to first instantiate an instance of some class.
This use is closely related to static
members in Java.
object A {
def twice(i: Int): Int = 2*i
}
You can then call above method using A.twice(2)
.
If twice
were a member of some class A
, then you would need to make an instance first:
class A() {
def twice(i: Int): Int = 2 * i
}
val a = new A()
a.twice(2)
You can see how redundant this is, as twice
does not require any instance-specific data.
object
as a special named instance
You can also use the object
itself as some special instance of a class or trait.
When you do this, your object needs to extend some trait
in order to become an instance of a subclass of it.
Consider the following code:
object A extends B with C {
...
}
This declaration first declares an anonymous (inaccessible) class that extends both B
and C
, and instantiates a single instance of this class named A
.
This means A
can be passed to functions expecting objects of type B
or C
, or B with C
.
Additional Features of object
There also exist some special features of objects in Scala. I recommend to read the official documentation.
def apply(...)
enables the usual method name-less syntax ofA(...)
def unapply(...)
allows to create custom pattern matching extractors- if accompanying a class of the same name, the object assumes a special role when resolving implicit parameters
As so many others have said, the object assigned to a val
cannot be replaced, and the object assigned to a var
can. However, said object can have its internal state modified. For example:
class A(n: Int) {
var value = n
}
class B(n: Int) {
val value = new A(n)
}
object Test {
def main(args: Array[String]) {
val x = new B(5)
x = new B(6) // Doesn't work, because I can't replace the object created on the line above with this new one.
x.value = new A(6) // Doesn't work, because I can't replace the object assigned to B.value for a new one.
x.value.value = 6 // Works, because A.value can receive a new object.
}
}
So, even though we can't change the object assigned to x
, we could change the state of that object. At the root of it, however, there was a var
.
Now, immutability is a good thing for many reasons. First, if an object doesn't change internal state, you don't have to worry if some other part of your code is changing it. For example:
x = new B(0)
f(x)
if (x.value.value == 0)
println("f didn't do anything to x")
else
println("f did something to x")
This becomes particularly important with multithreaded systems. In a multithreaded system, the following can happen:
x = new B(1)
f(x)
if (x.value.value == 1) {
print(x.value.value) // Can be different than 1!
}
If you use val
exclusively, and only use immutable data structures (that is, avoid arrays, everything in scala.collection.mutable
, etc.), you can rest assured this won't happen. That is, unless there's some code, perhaps even a framework, doing reflection tricks -- reflection can change "immutable" values, unfortunately.
That's one reason, but there is another reason for it. When you use var
, you can be tempted into reusing the same var
for multiple purposes. This has some problems:
- It will be more difficult for people reading the code to know what is the value of a variable in a certain part of the code.
- You may forget to re-initialize the variable in some code path, and end up passing wrong values downstream in the code.
Simply put, using val
is safer and leads to more readable code.
We can, then, go the other direction. If val
is that better, why have var
at all? Well, some languages did take that route, but there are situations in which mutability improves performance, a lot.
For example, take an immutable Queue
. When you either enqueue
or dequeue
things in it, you get a new Queue
object. How then, would you go about processing all items in it?
I'll go through that with an example. Let's say you have a queue of digits, and you want to compose a number out of them. For example, if I have a queue with 2, 1, 3, in that order, I want to get back the number 213. Let's first solve it with a mutable.Queue
:
def toNum(q: scala.collection.mutable.Queue[Int]) = {
var num = 0
while (!q.isEmpty) {
num *= 10
num += q.dequeue
}
num
}
This code is fast and easy to understand. Its main drawback is that the queue that is passed is modified by toNum
, so you have to make a copy of it beforehand. That's the kind of object management that immutability makes you free from.
Now, let's covert it to an immutable.Queue
:
def toNum(q: scala.collection.immutable.Queue[Int]) = {
def recurse(qr: scala.collection.immutable.Queue[Int], num: Int): Int = {
if (qr.isEmpty)
num
else {
val (digit, newQ) = qr.dequeue
recurse(newQ, num * 10 + digit)
}
}
recurse(q, 0)
}
Because I can't reuse some variable to keep track of my num
, like in the previous example, I need to resort to recursion. In this case, it is a tail-recursion, which has pretty good performance. But that is not always the case: sometimes there is just no good (readable, simple) tail recursion solution.
Note, however, that I can rewrite that code to use an immutable.Queue
and a var
at the same time! For example:
def toNum(q: scala.collection.immutable.Queue[Int]) = {
var qr = q
var num = 0
while (!qr.isEmpty) {
val (digit, newQ) = qr.dequeue
num *= 10
num += digit
qr = newQ
}
num
}
This code is still efficient, does not require recursion, and you don't need to worry whether you have to make a copy of your queue or not before calling toNum
. Naturally, I avoided reusing variables for other purposes, and no code outside this function sees them, so I don't need to worry about their values changing from one line to the next -- except when I explicitly do so.
Scala opted to let the programmer do that, if the programmer deemed it to be the best solution. Other languages have chosen to make such code difficult. The price Scala (and any language with widespread mutability) pays is that the compiler doesn't have as much leeway in optimizing the code as it could otherwise. Java's answer to that is optimizing the code based on the run-time profile. We could go on and on about pros and cons to each side.
Personally, I think Scala strikes the right balance, for now. It is not perfect, by far. I think both Clojure and Haskell have very interesting notions not adopted by Scala, but Scala has its own strengths as well. We'll see what comes up on the future.
Best Answer
reduce vs foldLeft
A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that
reduce
should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why
reduce
even exists. The collection can be chopped up and thereduce
can operate on each chunk, then thereduce
can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.If you just look at the signatures there is no reason for
reduce
to exist because you can achieve everything you can withreduce
with afoldLeft
. The functionality offoldLeft
is a greater than the functionality ofreduce
.But you cannot parallelize a
foldLeft
, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.foldLeft
does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.If you have a look at the Spark documentation for
reduce
it specifically says "... commutative and associative binary operator"http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
Here is proof that
reduce
is NOT just a special case offoldLeft
reduce vs fold
Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.
There is no
fold
method in Scalding because under the (strict) Map Reduce programming model we cannot definefold
because chunks do not have an ordering andfold
only requires associativity, not commutativity.Put simply,
reduce
works without an order of cumulation,fold
requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speakingreduce
should work on an empty collection, because its zero value can by deduced by taking an arbitrary valuex
and then solvingx op y = x
, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e.x op y != y op x
). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that
reduce
has become a synonym forfold
, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.So Spark does have a
fold
, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out thatfold
usesrunJob
, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having atreeReduce
but notreeFold
.Conclusion
There is a difference between
reduce
andfold
even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.In practice due to the unmathematical nature of programming,
reduce
andfold
tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).Extra: My Opinion On the Spark API
My opinion is that confusion would be avoided if use of the term
fold
was completely dropped in Spark. At least spark does have a note in their documentation: