basix
by basix
~1 min read

Categories

  • docs

Generating unique numeric values

There are a bunch of different ways of generating unique numeric ids in spark.

Here are some functions to utilize to accomplish that:

  1. row_number()
  2. monotonically_increasing_id()
  3. rdd.zipWithIndex()
  4. hash()

They all have their pros and cons. What you choose depends on

1) whether it works (you get uniqueness or not) and

2) whether you like the properties of the ids you end up having.

Row Number

Monothonically increasing id

Rdds ZipWithIndex

Hash functions