There is a bunch of different ways of generating numeric values in spark that can serve the purpose of unique identifiers. Here are some functions to utilize to accomplish that:
- hash functions
They all have their pros and cons. What you choose depends on one or both points below:
- whether it really gives you the uniqueness you hope to get
- whether you like the properties of the ids you end up having
One would think that the first point is actually no point at all, since we kinda have an understanding what uniqueness is and it is easy to test weather we have achieved it or not. But it is slightly more complex than that. It goes back to the hashing collisions and the size of table you are operating on. So, let’s start with that right away.
There are many, many, many hash functions and you could theoretically use any of the if you had the implementation and register it as an
udf. We are looking only for functions that will give us and integer as result and for those that are already in the spark library. This changes as new versions of the spark api come out and we are now at version
Let’s list a few of the hash functions available for free in spark:
Now looking at it, I wish the function
xxhash64 had been available in spark 2.X.X since that would have been the weapon of choice on the solution I was working on a few years back. The past is the past and I wanna explain with what restriction you have to fight in choosing the right hash function.
To be precise, this is the
pyspark.sql.functions.row_number1 window function which acts on a dataframe. Window functions are functions comming from the sql world which act on a smaller set of rows (a window of rows) within a bigger set of rows being processed. Let me explain.
If you provide
null as partition (in sql you can omit it completely), it will act on a single row as the window and provide you with a nice sequence starting from 1.
SELECT row_number() OVER(ORDER BY NULL) as rownum, * from [db.]tablename