How to run independent transformations in parallel using PySpark?
Just use threads and make sure that cluster have enough resources to process both tasks at the same time. from threading import Thread import time def process(rdd, f): def delay(x): time.sleep(1) return f(x) return rdd.map(delay).sum() rdd = sc.parallelize(range(100), int(sc.defaultParallelism / 2)) t1 = Thread(target=process, args=(rdd, lambda x: x * 2)) t2 = Thread(target=process, args=(rdd, lambda … Read more