Scala – Converting String RDD to Int RDD

apache-sparkscala

I am new to scala..I want to know when processing large datasets with scala in spark is it possible to read as int RDD instead of String RDD

I tried the below:

val intArr = sc
              .textFile("Downloads/data/train.csv")
              .map(line=>line.split(","))
              .map(_.toInt)

But I am getting the error:

error: value toInt is not a member of Array[String]

I need to convert to int rdd because down the line i need to do the below

val vectors = intArr.map(p => Vectors.dense(p))

which requires the type to be integer

Any kind of help is truly appreciated..thanks in advance

Best Answer

As far as I understood, one line should create one vector, so it should goes like:

val result = sc
           .textFile("Downloads/data/train.csv")
           .map(line => line.split(","))
           .map(numbers => Vectors.dense(numbers.map(_.toInt)))

numbers.map(_.toInt) will map every element of array to int, so result type will be Array[Int]

Related Topic