Spark-Monotonically increasing id not working as expected in dataframe?

It works as expected. This function is not intended for generating consecutive values. Instead it encodes partition number and index by partition

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

If you want consecutive numbers, use RDD.zipWithIndex.

Leave a Comment