Yes it possible but details will differ depending on an approach you take.
- If files are small, as you’ve mentioned, the simplest solution is to load your data using
SparkContext.wholeTextFiles
. It loads data asRDD[(String, String)]
where the the first element is path and the second file content. Then you parse each file individually like in a local mode. - For larger files you can use Hadoop input formats.
- If structure is simple you can split records using
textinputformat.record.delimiter
. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed - Otherwise Mahout provides
XmlInputFormat
- If structure is simple you can split records using
-
Finally it is possible to read file using
SparkContext.textFile
and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:- use
mapPartitionsWithIndex
partitions to identify records broken between partitions, collect broken records - use second
mapPartitionsWithIndex
to repair broken records
- use
Edit:
There is also relatively new spark-xml
package which allows you to extract specific records by tag:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")