Fastest way to parse large CSV files in Pandas

As @chrisb said, pandas’ read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don’t think you will find something better to parse the csv (as a note, read_csv is not a ‘pure python’ solution, as the CSV parser is implemented in C). But, if you have to load/query the data often, a solution would be to parse … Read more

Confusion about pandas copy of slice of dataframe warning

izmir = pd.read_excel(filepath) izmir_lim = izmir[[‘Gender’,’Age’,’MC_OLD_M>=60′,’MC_OLD_F>=60′, ‘MC_OLD_M>18′,’MC_OLD_F>18′,’MC_OLD_18>M>5’, ‘MC_OLD_18>F>5′,’MC_OLD_M_Child<5′,’MC_OLD_F_Child<5’, ‘MC_OLD_M>0<=1′,’MC_OLD_F>0<=1′,’Date to Delivery’, ‘Date to insert’,’Date of Entery’]] izmir_lim is a view/copy of izmir. You subsequently attempt to assign to it. This is what is throwing the error. Use this instead: izmir_lim = izmir[[‘Gender’,’Age’,’MC_OLD_M>=60′,’MC_OLD_F>=60′, ‘MC_OLD_M>18′,’MC_OLD_F>18′,’MC_OLD_18>M>5’, ‘MC_OLD_18>F>5′,’MC_OLD_M_Child<5′,’MC_OLD_F_Child<5’, ‘MC_OLD_M>0<=1′,’MC_OLD_F>0<=1′,’Date to Delivery’, ‘Date to insert’,’Date of Entery’]].copy() Whenever you ‘create’ … Read more

pandas combine two strings ignore nan values

Call fillna and pass an empty str as the fill value and then sum with param axis=1: In [3]: df = pd.DataFrame({‘a’:[‘asd’,np.NaN,’asdsa’], ‘b’:[‘asdas’,’asdas’,np.NaN]}) df Out[3]: a b 0 asd asdas 1 NaN asdas 2 asdsa NaN In [7]: df[‘a+b’] = df.fillna(”).sum(axis=1) df Out[7]: a b a+b 0 asd asdas asdasdas 1 NaN asdas asdas 2 … Read more

“Too many indexers” with DataFrame.loc

The reason this doesn’t work is tied to the need to specify the axis of indexing (mentioned in http://pandas.pydata.org/pandas-docs/stable/advanced.html). An alternative solution to your problem is to simply do this: df.loc(axis=0)[:, :, ‘C1′, :] Pandas gets confused sometimes when indexes are similar or contain similar values. If you were to have a column named ‘C1’ … Read more

Replace whole string if it contains substring in pandas

You can use str.contains to mask the rows that contain ‘ball’ and then overwrite with the new value: In [71]: df.loc[df[‘sport’].str.contains(‘ball’), ‘sport’] = ‘ball sport’ df Out[71]: name sport 0 Bob tennis 1 Jane ball sport 2 Alice ball sport To make it case-insensitive pass `case=False: df.loc[df[‘sport’].str.contains(‘ball’, case=False), ‘sport’] = ‘ball sport’