There is a bunch of different ways of generating numeric values in spark that can serve the purpose of unique identifiers. Here are some functions to utilize to accomplish that:
row_number()
monotonically_increasing_id()
rdd.zipWithIndex()
hash()
They all have their pros and cons. What you choose depends on one or both points below:
- whether it really gives you the uniqueness you hope to get
- whether you like the properties of the ids you end up having
One would think that the first point is actually no point at all, since we kinda have an understanding what uniqueness is and it is easy to test weather we have achieved it or not. But it is slightly more complex than that. It goes back to the hashing collisions and the size of table you are operating on. So, let’s start with that right away.
Hash functions
Row Number
To be precise, this is the pyspark.sql.functions.row_number
window function which acts on a dataframe. Window functions are functions comming from the sql world which act on a smaller set of rows (a window of rows) within a bigger set of rows being processed. Let me explain.