If you are simply looking to count the number of rows in the rdd
, do:
val distFile = sc.textFile(file)
println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator
:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html