Fastest way to parse large CSV files in Pandas

As @chrisb said, pandas’ read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don’t think you will find something better to parse the csv (as a note, read_csv is not a ‘pure python’ solution, as the CSV parser is implemented in C). But, if you have to load/query the data often, a solution would be to parse … Read more

Confusion about pandas copy of slice of dataframe warning

izmir = pd.read_excel(filepath) izmir_lim = izmir[[‘Gender’,’Age’,’MC_OLD_M>=60′,’MC_OLD_F>=60′, ‘MC_OLD_M>18′,’MC_OLD_F>18′,’MC_OLD_18>M>5’, ‘MC_OLD_18>F>5′,’MC_OLD_M_Child<5′,’MC_OLD_F_Child<5’, ‘MC_OLD_M>0<=1′,’MC_OLD_F>0<=1′,’Date to Delivery’, ‘Date to insert’,’Date of Entery’]] izmir_lim is a view/copy of izmir. You subsequently attempt to assign to it. This is what is throwing the error. Use this instead: izmir_lim = izmir[[‘Gender’,’Age’,’MC_OLD_M>=60′,’MC_OLD_F>=60′, ‘MC_OLD_M>18′,’MC_OLD_F>18′,’MC_OLD_18>M>5’, ‘MC_OLD_18>F>5′,’MC_OLD_M_Child<5′,’MC_OLD_F_Child<5’, ‘MC_OLD_M>0<=1′,’MC_OLD_F>0<=1′,’Date to Delivery’, ‘Date to insert’,’Date of Entery’]].copy() Whenever you ‘create’ … Read more

pandas combine two strings ignore nan values

Call fillna and pass an empty str as the fill value and then sum with param axis=1: In [3]: df = pd.DataFrame({‘a’:[‘asd’,np.NaN,’asdsa’], ‘b’:[‘asdas’,’asdas’,np.NaN]}) df Out[3]: a b 0 asd asdas 1 NaN asdas 2 asdsa NaN In [7]: df[‘a+b’] = df.fillna(”).sum(axis=1) df Out[7]: a b a+b 0 asd asdas asdasdas 1 NaN asdas asdas 2 … Read more

“Too many indexers” with DataFrame.loc

The reason this doesn’t work is tied to the need to specify the axis of indexing (mentioned in An alternative solution to your problem is to simply do this: df.loc(axis=0)[:, :, ‘C1′, :] Pandas gets confused sometimes when indexes are similar or contain similar values. If you were to have a column named ‘C1’ … Read more

Replace whole string if it contains substring in pandas

You can use str.contains to mask the rows that contain ‘ball’ and then overwrite with the new value: In [71]: df.loc[df[‘sport’].str.contains(‘ball’), ‘sport’] = ‘ball sport’ df Out[71]: name sport 0 Bob tennis 1 Jane ball sport 2 Alice ball sport To make it case-insensitive pass `case=False: df.loc[df[‘sport’].str.contains(‘ball’, case=False), ‘sport’] = ‘ball sport’