how do we choose –nthreads and –nprocs per worker in dask distributed?

It depends on your workload By default Dask creates a single process with as many threads as you have logical cores on your machine (as determined by multiprocessing.cpu_count()). dask-worker … –nprocs 1 –nthreads 8 # assuming you have eight cores dask-worker … # this is actually the default setting Using few processes and many threads … Read more

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

You can parallelize this with Dask.dataframe. >>> dmaster = dd.from_pandas(master, npartitions=4) >>> dmaster[‘my_value’] = dmaster.original.apply(lambda x: helper(x, slave), name=”my_value”)) >>> dmaster.compute() original my_value 0 this is a nice sentence 2 1 this is another one 3 2 stackoverflow is nice 1 Additionally, you should think about the tradeoffs between using threads vs processes here. Your … Read more

python dask DataFrame, support for (trivially parallelizable) row apply?

map_partitions You can apply your function to all of the partitions of your dataframe with the map_partitions function. df.map_partitions(func, columns=…) Note that func will be given only part of the dataset at a time, not the entire dataset like with pandas apply (which presumably you wouldn’t want if you want to do parallelism.) map / … Read more

Make Pandas DataFrame apply() use all cores?

The simplest way is to use Dask’s map_partitions. You need these imports (you will need to pip install dask): import pandas as pd import dask.dataframe as dd from dask.multiprocessing import get and the syntax is data = <your_pandas_dataframe> ddata = dd.from_pandas(data, npartitions=30) def myfunc(x,y,z, …): return <whatever> res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get) … Read more