Difference between df[x], df[[x]], df[‘x’] , df[[‘x’]] and df.x

df[x] — index a column using variable x. Returns pd.Series df[[x]] — index/slice a single-column DataFrame using variable x. Returns pd.DataFrame df[‘x’] — index a column named ‘x’. Returns pd.Series df[[‘x’]] — index/slice a single-column DataFrame having only one column named ‘x’. Returns pd.DataFrame df.x — dot accessor notation, equivalent to df[‘x’] (there are, however, … Read more

how to understand closed and label arguments in pandas resample method?

Short answer: If you use closed=’left’ and loffset=”2T” then you’ll get what you expected: series.resample(‘3T’, label=”left”, closed=’left’, loffset=”2T”).sum() 2000-01-01 00:02:00 3 2000-01-01 00:05:00 12 2000-01-01 00:08:00 21 Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in … Read more

Difference between === null and isNull in Spark DataDrame

First and foremost don’t use null in your Scala code unless you really have to for compatibility reasons. Regarding your question it is plain SQL. col(“c1”) === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself. spark.sql(“SELECT NULL = NULL”).show +————-+ |(NULL … Read more

Plot pandas dates in matplotlib

If you use a list containing the column name(s) instead of a string, data.set_index will work The following should show the dates on x-axis: #! /usr/bin/env python import pandas as pd import matplotlib.pyplot as plt data = pd.read_fwf(‘myfile.log’,header=None,names=[‘time’,’amount’],widths=[27,5]) data.time = pd.to_datetime(data[‘time’], format=”%Y-%m-%d %H:%M:%S.%f”) data.set_index([‘time’],inplace=True) data.plot() #OR plt.plot(data.index, data.amount)

When is it appropriate to use df.value_counts() vs df.groupby(‘…’).count()?

There is difference value_counts return: The resulting object will be in descending order so that the first element is the most frequently-occurring element. but count not, it sort output by index (created by column in groupby(‘col’)). df.groupby(‘colA’).count() is for aggregate all columns of df by function count. So it count values excluding NaNs. So if … Read more

Confusion about pandas copy of slice of dataframe warning

izmir = pd.read_excel(filepath) izmir_lim = izmir[[‘Gender’,’Age’,’MC_OLD_M>=60′,’MC_OLD_F>=60′, ‘MC_OLD_M>18′,’MC_OLD_F>18′,’MC_OLD_18>M>5’, ‘MC_OLD_18>F>5′,’MC_OLD_M_Child<5′,’MC_OLD_F_Child<5’, ‘MC_OLD_M>0<=1′,’MC_OLD_F>0<=1′,’Date to Delivery’, ‘Date to insert’,’Date of Entery’]] izmir_lim is a view/copy of izmir. You subsequently attempt to assign to it. This is what is throwing the error. Use this instead: izmir_lim = izmir[[‘Gender’,’Age’,’MC_OLD_M>=60′,’MC_OLD_F>=60′, ‘MC_OLD_M>18′,’MC_OLD_F>18′,’MC_OLD_18>M>5’, ‘MC_OLD_18>F>5′,’MC_OLD_M_Child<5′,’MC_OLD_F_Child<5’, ‘MC_OLD_M>0<=1′,’MC_OLD_F>0<=1′,’Date to Delivery’, ‘Date to insert’,’Date of Entery’]].copy() Whenever you ‘create’ … Read more

Filter spark DataFrame on string contains

You can use contains (this works with an arbitrary sequence): df.filter($”foo”.contains(“bar”)) like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence): df.filter($”foo”.like(“bar”)) or rlike (like with Java regular expressions): df.filter($”foo”.rlike(“bar”)) depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.