how to add Row id in pySpark dataframes [duplicate]

You can use also use a function from sql package. It will generate a unique id, however it will not be sequential as it depends on the number of partitions. I believe it is available in Spark 1.5 +

from pyspark.sql.functions import monotonicallyIncreasingId

# This will return a new DF with all the columns + id
res = df.withColumn("id", monotonicallyIncreasingId())

Edit: 19/1/2017

As commented by @Sean

Use monotonically_increasing_id() instead from Spark 1.6 and on

More Related Contents:

How to add a constant column in a Spark DataFrame?
Load CSV file with Spark
How do I add a new column to a Spark DataFrame (using PySpark)?
How to perform union on two DataFrames with different amounts of columns in spark?
Does spark predicate pushdown work with JDBC?
‘PipelinedRDD’ object has no attribute ‘toDF’ in PySpark
Count number of non-NaN entries in each column of Spark dataframe with Pyspark
Spark DataFrame: Computing row-wise mean (or any aggregate operation)
Passing a data frame column and external list to udf under withColumn
Updating a dataframe column in spark
PySpark converting a column of type ‘map’ to multiple columns in a dataframe
Spark add new column to dataframe with value from previous row
Best way to get the max value in a Spark dataframe column
Pyspark: Replacing value in a column by searching a dictionary
Apache Spark Python Cosine Similarity over DataFrames
Pivot String column on Pyspark Dataframe
GroupBy column and filter rows with maximum value in Pyspark
Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
Cannot find col function in pyspark
How to pass a constant value to Python UDF?
How to zip two array columns in Spark SQL
Concatenate two PySpark dataframes
How do I convert an array (i.e. list) column to Vector
Filtering DataFrame using the length of a column
Add column sum as new column in PySpark dataframe
When to cache a DataFrame?
How to add suffix and prefix to all columns in python/pyspark dataframe
How to drop all columns with null values in a PySpark DataFrame?
Pyspark changing type of column from date to string
PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

More Related Contents:

Leave a Comment Cancel reply