python – Using pandas structures with large csv(iterate and chunksize)

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):

Then use concat for all chunks to df, because type of output of function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

isn’t dataframe, but pandas.io.parsers.TextFileReadersource.

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)

I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.

EDIT:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

Leave a Comment