creating spark data structure from multiline record

PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below

from operator import itemgetter

retrosheet = sc.newAPIHadoopFile(
    '/path/to/retrosheet/file',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
    .filter(itemgetter(1))
    .values()
    .filter(lambda x: x)
    .map(lambda v: (
        v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))

Since Spark 2.4 you can also read data into DataFrame using text reader

spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

Leave a Comment