how to add Row id in pySpark dataframes [duplicate]

You can use also use a function from sql package. It will generate a unique id, however it will not be sequential as it depends on the number of partitions. I believe it is available in Spark 1.5 +

from pyspark.sql.functions import monotonicallyIncreasingId

# This will return a new DF with all the columns + id
res = df.withColumn("id", monotonicallyIncreasingId())

Edit: 19/1/2017

As commented by @Sean

Use monotonically_increasing_id() instead from Spark 1.6 and on

Leave a Comment