basix
by basix
3 min read

Categories

  • docs

There is a bunch of different ways of generating numeric values in spark that can serve the purpose of unique identifiers. Here are some functions to utilize to accomplish that:

  1. row_number()
  2. monotonically_increasing_id()
  3. rdd.zipWithIndex()
  4. hash functions

They all have their pros and cons. What you choose depends on one or both points below:

  1. whether it really gives you the uniqueness you hope to get
  2. whether you like the properties of the ids you end up having

One would think that the first point is actually no point at all, since we kinda have an understanding what uniqueness is and it is easy to test weather we have achieved it or not. But it is slightly more complex than that. It goes back to the hashing collisions and the size of table you are operating on. So, let’s start with that right away.

Hash functions

There are many, many, many hash functions and you could theoretically use any of the if you had the implementation and register it as an udf. We are looking only for functions that will give us and integer as result and for those that are already in the spark library. This changes as new versions of the spark api come out and we are now at version 3.1.2.

Let’s list a few of the hash functions available for free in spark:

  • hash
  • xxhash64
  • crc32
  • sha, sha1 and sha2
  • md5

Now looking at it, I wish the function xxhash64 had been available in spark 2.X.X since that would have been the weapon of choice on the solution I was working on a few years back. The past is the past and I wanna explain with what restriction you have to fight in choosing the right hash function.

Row Number

To be precise, this is the pyspark.sql.functions.row_number1 window function which acts on a dataframe. Window functions are functions comming from the sql world which act on a smaller set of rows (a window of rows) within a bigger set of rows being processed. Let me explain.

If you provide null as partition (in sql you can omit it completely), it will act on a single row as the window and provide you with a nice sequence starting from 1.

SELECT row_number() OVER(ORDER BY NULL) as rownum, * from [db.]tablename 

Monothonically increasing id

Rdds ZipWithIndex