bigdata - w3toppers.com

Spark parquet partitioning : Large number of files

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing … Read more

Best way to delete millions of rows by ID

It all depends … Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all. Delete all indexes (possibly except the ones needed for the delete itself). Recreate them afterwards. That’s typically much faster than incremental updates to indexes. Check … Read more

How to create a large pandas dataframe from an sql query without running out of memory?

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk: sql = “SELECT * FROM My_Table” for chunk in pd.read_sql_query(sql , engine, chunksize=5): print(chunk) Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Using numpy.memmap you create arrays directly mapped into a file: import numpy a = numpy.memmap(‘test.mymemmap’, dtype=”float32″, mode=”w+”, shape=(200000,1000)) # here you will see a 762MB file created in your working directory You can treat it as a conventional array: a += 1000. It is possible even to assign more arrays to the same file, controlling … Read more

Calculating and saving space in PostgreSQL

“Column Tetris” Actually, you can do something, but this needs deeper understanding. The keyword is alignment padding. Every data type has specific alignment requirements. You can minimize space lost to padding between columns by ordering them favorably. The following (extreme) example would waste a lot of physical disk space: CREATE TABLE t ( e int2 … Read more

Information about big data and hadoop [closed]

I find the tutorials from hortonworks a rather good starting point http://hortonworks.com/tutorials/#tuts-developers To deep dive a must read is Tom White’s ‘Hadoop: The Definitve Guide’. ‘Hadoop in Practice’ shows a lot of cookbook like examples.