Scala – How to flatten tuples in Spark

apache-sparkrddscala

I'm looking to flatten an RDD of tuples (using a no-op map), but I'm getting a type error:

val fromTuples = sc.parallelize( List((1,"a"), (2, "b"), (3, "c")) )
val flattened = fromTuples.flatMap(x => x)
println(flattened.collect().toNiceString)

Gives

error: type mismatch;

found : (Int, String)
required: TraversableOnce[?]

val flattened = fromMap.flatMap(x => x)

The equivalent list of Lists or Arrays work fine, e.g.:

val fromList = sc.parallelize(List(List(1, 2), List(3, 4)))
val flattened = fromList.flatMap(x => x)
println(flattened.collect().toNiceString)

Can Scala handle this? If not, why not?

Best Answer

Tuples aren't collections. Unlike Python, where a tuple is essentially just an immutable list, a tuple in Scala is more like a class (or more like a Python namedtuple). You can't "flatten" a tuple, because it's a heterogeneous group of fields.

You can convert a tuple to something iterable by calling .productIterator on it, but what you get back is an Iterable[Any]. You can certainly flatten such a thing, but you've lost all compile-time type protection that way. (Most Scala programmers shudder at the thought of a collection of type Any.)

Related Topic