pandas - w3toppers.com

Convert a spark DataFrame to pandas DF

following should work Sample DataFrame some_df = sc.parallelize([ (“A”, “no”), (“B”, “yes”), (“B”, “yes”), (“B”, “no”)] ).toDF([“user_id”, “phone_number”]) Converting DataFrame to Pandas DataFrame pandas_df = some_df.toPandas()

Splitting multiple columns into rows in pandas dataframe

You can first split columns, create Series by stack and remove whitespaces by strip: s1 = df.value.str.split(‘,’, expand=True).stack().str.strip().reset_index(level=1, drop=True) s2 = df.date.str.split(‘,’, expand=True).stack().str.strip().reset_index(level=1, drop=True) Then concat both Series to df1: df1 = pd.concat([s1,s2], axis=1, keys=[‘value’,’date’]) Remove old columns value and date and join: print (df.drop([‘value’,’date’], axis=1).join(df1).reset_index(drop=True)) ticker account value date 0 aa assets 100 20121231 … Read more

Pandas: convert date in month to the 1st day of next month

You can use pd.offsets.MonthBegin() In [261]: d = pd.to_datetime([‘2011-09-30’, ‘2012-02-28’]) In [262]: d Out[262]: DatetimeIndex([‘2011-09-30’, ‘2012-02-28’], dtype=”datetime64[ns]”, freq=None) In [263]: d + pd.offsets.MonthBegin(1) Out[263]: DatetimeIndex([‘2011-10-01’, ‘2012-03-01′], dtype=”datetime64[ns]”, freq=None) You’ll find a lot of examples in the official Pandas docs

Suppress output of object when plotting in IPython

Just put ; after the code. It works only in Jupyter Notebook. plt.hist(…);

Why does it take ages to install Pandas on Alpine Linux

Debian based images use only python pip to install packages with .whl format: Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB) Downloading numpy-1.14.1-cp36-cp36m-manylinux1_x86_64.whl (12.2MB) WHL format was developed as a quicker and more reliable method of installing Python software than re-building from source code every time. WHL files only have to be moved to the correct location on the target … Read more

Merge two data frames based on common column values in Pandas

You can use pd.merge: import pandas as pd pd.merge(df1, df2, on=”movie_title”) Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how=”left”: pd.merge(df1, … Read more

Standard implementation of vectorize_sequences

Solution with MultiLabelBinarizer Assuming sequences is an array of integers with maximum possible value upto dimension-1, we can use MultiLabelBinarizer from sklearn.preprocessing to replicate the behaviour of the function vectorize_sequences from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(classes=range(dimension)) mlb.fit_transform(sequences) Solution with Numpy broadcasting Assuming sequences is an array of integers with maximum possible value upto dimension-1 … Read more

pandas dataframe group and sort by weekday

You can use ordered catagorical first: cats = [ ‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’, ‘Saturday’, ‘Sunday’] df[‘Day of Week’] = df[‘Day of Week’].astype(‘category’, categories=cats, ordered=True) In pandas 0.21.0+ use: from pandas.api.types import CategoricalDtype cat_type = CategoricalDtype(categories=cats, ordered=True) df[‘Day of Week’] = df[‘Day of Week’].astype(cat_type) Or reindex: df_weekday = df.groupby([‘Day of Week’]).sum().reindex(cats)

Pandas: fill in NaN values with dictionary references another column

You can map dict values inside fillna df.B = df.B.fillna(df.A.map(dict)) print(df) A B 0 a 2 1 b 5 2 c 4

Kalman filter with varying timesteps

For a Kalman filter it is useful to represent the input data with a constant time step. Your sensors send data randomly, so you can define the smallest significant time step for your system and discretize the time axis with this step. For example one of your sensors sends data approximately each 0.2 seconds and … Read more