basix
by basix
1 min read

Categories

  • docs

There is a bunch of different ways of generating numeric values in spark that can serve the purpose of unique identifiers. Here are some functions to utilize to accomplish that:

  1. row_number()
  2. monotonically_increasing_id()
  3. rdd.zipWithIndex()
  4. hash()

They all have their pros and cons. What you choose depends on one or both points below:

  1. whether it really gives you the uniqueness you hope to get
  2. whether you like the properties of the ids you end up having

One would think that the first point is actually no point at all, since we kinda have an understanding what uniqueness is and it is easy to test weather we have achieved it or not. But it is slightly more complex than that. It goes back to the hashing collisions and the size of table you are operating on. So, let’s start with that right away.

Hash functions

Row Number

To be precise, this is the pyspark.sql.functions.row_number window function which acts on a dataframe. Window functions are functions comming from the sql world which act on a smaller set of rows (a window of rows) within a bigger set of rows being processed. Let me explain.

Monothonically increasing id

Rdds ZipWithIndex