Scala – How to count the average from Spark RDD

apache-sparkrddscala

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,

[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]

I want to count them like this,

[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]

then,get the result like this,

   [(2,120),(3,204),(4,160)]

How can I do this with scala from RDD?
I use spark version 1.6

Best Answer

you can use aggregateByKey.

val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect
Related Topic