collect() or toPandas() on a large DataFrame in pyspark/EMR

TL;DR I believe you’re seriously underestimating memory requirements. Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver. First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can … Read more

using time zone in pandas to_datetime

You can use tz_localize to set the timezone to UTC/+0000, and then tz_convert to add the timezone you want: start = pd.to_datetime(‘2015-02-24’) rng = pd.date_range(start, periods=10) df = pd.DataFrame({‘Date’: rng, ‘a’: range(10)}) df.Date = df.Date.dt.tz_localize(‘UTC’).dt.tz_convert(‘Asia/Kolkata’) print (df) Date a 0 2015-02-24 05:30:00+05:30 0 1 2015-02-25 05:30:00+05:30 1 2 2015-02-26 05:30:00+05:30 2 3 2015-02-27 05:30:00+05:30 3 … Read more

Jupyter notebook display two pandas tables side by side

I have ended up writing a function that can do this: [update: added titles based on suggestions (thnx @Antony_Hatchkins et al.)] from IPython.display import display_html from itertools import chain,cycle def display_side_by_side(*args,titles=cycle([”])): html_str=”” for df,title in zip(args, chain(titles,cycle([‘</br>’])) ): html_str+='<th style=”text-align:center”><td style=”vertical-align:top”>’ html_str+=f'<h2 style=”text-align: center;”>{title}</h2>’ html_str+=df.to_html().replace(‘table’,’table style=”display:inline”‘) html_str+='</td></th>’ display_html(html_str,raw=True) Example usage: df1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns=[‘A’,’B’,’C’,’D’,]) df2 = … Read more

(pandas) Create new column based on first element in groupby object

You need transform with first: print (df.groupby(‘Person’)[‘Color’].transform(‘first’)) 0 blue 1 green 2 orange 3 blue 4 green 5 orange Name: Color, dtype: object df[‘First_Col’] = df.groupby(‘Person’)[‘Color’].transform(‘first’) print (df) Color Person First_Col 0 blue bob blue 1 green jim green 2 orange joe orange 3 yellow bob blue 4 pink jim green 5 purple joe orange

Trouble installing Pandas on new MacBook Air M1

Maybe it is too late. But the only solution worked for me is installing from source if you do not want to use rosetta2 or moniconda python3 -m pip install virtualenv virtualenv -p python3.8 venv source venv/bin/activate pip install –upgrade pip pip install numpy cython git clone –depth 1 https://github.com/pandas-dev/pandas.git cd pandas python3 setup.py install

After rename column get keyerror

You aren’t expected to alter the values attribute. Try df.columns.values = [‘a’, ‘b’, ‘c’] and you get: ————————————————————————— AttributeError Traceback (most recent call last) <ipython-input-61-e7e440adc404> in <module>() —-> 1 df.columns.values = [‘a’, ‘b’, ‘c’] AttributeError: can’t set attribute That’s because pandas detects that you are trying to set the attribute and stops you. However, it … Read more