I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
Scala:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
Spark:
val distFile = sc.textFile(file)
println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
Best Answer
If you are simply looking to count the number of rows in the
rdd
, do:If you are interested in the bytes, you can use the
SizeEstimator
:https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html