reading json file in pyspark

First of all, the json is invalid. After the header a , is missing.

That being said, lets take this json:

{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}

This can be processed by:

>>> df = sqlContext.jsonFile('test.json')
>>> df.first()
Row(details=[Row(abc="3", def="4"), Row(abc="5", def="6"), Row(abc="7", def="8")], header=Row(platform='atm', version='2.0'))

>>> df = df.flatMap(lambda row: row['details'])
PythonRDD[38] at RDD at PythonRDD.scala:43

>>> df.collect()
[Row(abc="3", def="4"), Row(abc="5", def="6"), Row(abc="7", def="8")]

>>> df.map(lambda entry: (int(entry['abc']),     int(entry['def']))).collect()
[(3, 4), (5, 6), (7, 8)]

Hope this helps!

Leave a Comment