Spark parquet partitioning : Large number of files

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing … Read more

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Using numpy.memmap you create arrays directly mapped into a file: import numpy a = numpy.memmap(‘test.mymemmap’, dtype=”float32″, mode=”w+”, shape=(200000,1000)) # here you will see a 762MB file created in your working directory You can treat it as a conventional array: a += 1000. It is possible even to assign more arrays to the same file, controlling … Read more

Calculating and saving space in PostgreSQL

“Column Tetris” Actually, you can do something, but this needs deeper understanding. The keyword is alignment padding. Every data type has specific alignment requirements. You can minimize space lost to padding between columns by ordering them favorably. The following (extreme) example would waste a lot of physical disk space: CREATE TABLE t ( e int2 … Read more