Xml processing in Spark

Yes it possible but details will differ depending on an approach you take.

  • If files are small, as you’ve mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
  • For larger files you can use Hadoop input formats.
    • If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
    • Otherwise Mahout provides XmlInputFormat
  • Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:

    • use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
    • use second mapPartitionsWithIndex to repair broken records

Edit:

There is also relatively new spark-xml package which allows you to extract specific records by tag:

val df = sqlContext.read
  .format("com.databricks.spark.xml")
   .option("rowTag", "foo")
   .load("bar.xml")

Leave a Comment