Why Big Data Needs To Be Functional

algorithmsbig datascala

I started working on a new project lately related to Big Data for my internship.
My managers recommended to start learning functional programming (They highly recommended Scala).
I had a humbled experience using F#, but I couldn't see the the important of using this paradigm of programming as it's expensive in some cases.

Dean gave an interesting talk about this topic, and shared his thoughts on why "Big Data" here : http://www.youtube.com/watch?v=DFAdLCqDbLQ
But it wasn't very convenient as Big Data doesn't mean only Hadoop.

As BigData is very vague concept. I forget it for a while. I tried to come-up with one simple example to compare between the different aspects when we deal with data, to see if functional way is expensive or no.
If functional programming is expensive and memory-consuming for small data, why do we need it for Big Data ?

Far from fancy tools, I tried to build a solution for one specific and popular problem using three approaches: Imperative way and functional way (recursion, using collections). I compared time and complexity,to compare between the three approaches.

I used Scala to write these functions as it's the best tool to write an algorithm using three paradigms

def main(args: Array[String]) {
    val start = System.currentTimeMillis()
    // Fibonacci_P
    val s = Fibonacci_P(400000000)
    val end = System.currentTimeMillis()
    println("Functional way: \n the Fibonacci sequence whose values do not exceed four million : %d \n Time : %d ".format(s, end - start))
    val start2 = System.currentTimeMillis()

    // Fibonacci_I
    val s2 = Fibonacci_I(40000000 0)
    val end2 = System.currentTimeMillis();
    println("Imperative way: \n the Fibonacci sequence whose values do not exceed four million : %d \n Time : %d ".format(s2, end2 - start2))
}

Functional way :

def Fibonacci_P(max: BigInt): BigInt = {
    //http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.Stream
    //lazy val Fibonaccis: Stream[Long] = 0 #:: 1 #:: Fibonaccis.zip(Fibonaccis.tail).map { case (a, b) => a + b }
    lazy val fibs: Stream[BigInt] = BigInt(0)#::BigInt(1)#::fibs.zip(fibs.tail).map {
        n = > n._1 + n._2
    }
    // println(fibs.takeWhile(p => p < max).toList)
    fibs.takeWhile(p = > p < max).foldLeft(BigInt(0))(_ + _)
}

Recursive way:

def Fibonacci_R(n: Int): BigInt = n match {
    case 1 | 2 = > 1
    case _ = > Fibonacci_R(n - 1) + Fibonacci_R(n - 2)
}

Imperative way:

def Fibonacci_I(max: BigInt): BigInt = {
    var first_element: BigInt = 0
    var second_element: BigInt = 1
    var sum: BigInt = 0

    while (second_element < max) {
        sum += second_element

        second_element = first_element + second_element
        first_element = second_element - first_element
    }

    //Return 
    sum
}

I noticed that Functional programming is heavy! it takes longer time and consume more space in memory.
I am confused, whenever I read an article or watch a talk, they say that we should use functional programming in data science.
True, it 's easier and more productive, specially in data world. but it takes more time and more memory space .

So, why do we need to use Functional programming in Big Data ?
What are the best practices to use functional programming (Scala) for Big Data ?

Best Answer

Here's how I see it:

  • Let's ignore the words "big data" for a while, as they are a pretty vague notion

  • You mentioned Hadoop. Hadoop does 2 things: allows you to have a sort of "virtual" drive which is distributed on multiple machines, with redundancy, that can be accessed via Hadoop's API as if it were a single, unitary, drive. It's called HDFS as in Hadoop Distributed File System. The other thing Hadoop does is allow you to execute Map-Reduce jobs (it's a framework for Map-Reduce). If we check out MapReduce's Wikipedia page, we see that:

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

...

A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

...

'MapReduce' is a framework for processing parallelizable problems across huge datasets using a large number of computers

Also on this page, Hadoop is described as

Hadoop, Apache's free and open source implementation of MapReduce.

Now, Hadoop is written in Java, which is not a functional language. Also, if we look on Hadoop's page, we also find an example of how to create a MapReduce job in Java and deploy it in a Hadoop cluster.

Here's a Java example of a Fibonnaci MapReduce job for Hadoop.

I hope this answer your question, namely that BigData, and in particular a Fibonacci-creating MapReduce job doesn't "need" to be functional, aka you can implement it in OO languages if you want to (for example).

Of course that doesn't mean BigData "needs" to be OO-only either. You can very well use a functional language to implement a MapReduce like job. You can, for example, use Scala with Hadoop if you want to, via Scalding.

Other points I think are worth mentioning.

When doing recursion in Scala, if your code allows for it, Scala will do tail-call-optimization. However, since the JVM doesn't (yet) support tail-call-optimization, Scala achieves this by replacing, at compile time, your recursive calls with code equivalent to loops, as explained here. What this basically means is that doing recursive vs non-recursive code benchmarks using Scala is pointless, as they both end up doing the same thing at run time.

Related Topic