The solution I suggested in Stratified sampling in Spark is pretty straightforward to convert from Scala to Python (or even to Java – What’s the easiest way to stratify a Spark Dataset ?). Nevertheless, I’ll rewrite it python. Let’s start first by creating a toy DataFrame : from pyspark.sql.functions import lit list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)] df … Read more