Why does it take ages to install Pandas on Alpine Linux

Debian based images use only python pip to install packages with .whl format: Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB) Downloading numpy-1.14.1-cp36-cp36m-manylinux1_x86_64.whl (12.2MB) WHL format was developed as a quicker and more reliable method of installing Python software than re-building from source code every time. WHL files only have to be moved to the correct location on the target … Read more

Standard implementation of vectorize_sequences

Solution with MultiLabelBinarizer Assuming sequences is an array of integers with maximum possible value upto dimension-1, we can use MultiLabelBinarizer from sklearn.preprocessing to replicate the behaviour of the function vectorize_sequences from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(classes=range(dimension)) mlb.fit_transform(sequences) Solution with Numpy broadcasting Assuming sequences is an array of integers with maximum possible value upto dimension-1 … Read more

pandas dataframe group and sort by weekday

You can use ordered catagorical first: cats = [ ‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’, ‘Saturday’, ‘Sunday’] df[‘Day of Week’] = df[‘Day of Week’].astype(‘category’, categories=cats, ordered=True) In pandas 0.21.0+ use: from pandas.api.types import CategoricalDtype cat_type = CategoricalDtype(categories=cats, ordered=True) df[‘Day of Week’] = df[‘Day of Week’].astype(cat_type) Or reindex: df_weekday = df.groupby([‘Day of Week’]).sum().reindex(cats)

collect() or toPandas() on a large DataFrame in pyspark/EMR

TL;DR I believe you’re seriously underestimating memory requirements. Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver. First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can … Read more

using time zone in pandas to_datetime

You can use tz_localize to set the timezone to UTC/+0000, and then tz_convert to add the timezone you want: start = pd.to_datetime(‘2015-02-24’) rng = pd.date_range(start, periods=10) df = pd.DataFrame({‘Date’: rng, ‘a’: range(10)}) df.Date = df.Date.dt.tz_localize(‘UTC’).dt.tz_convert(‘Asia/Kolkata’) print (df) Date a 0 2015-02-24 05:30:00+05:30 0 1 2015-02-25 05:30:00+05:30 1 2 2015-02-26 05:30:00+05:30 2 3 2015-02-27 05:30:00+05:30 3 … Read more