case_when function from R to Python

You want to use np.select: conditions = [ (df[“age”].lt(10)), (df[“age”].ge(10) & df[“age”].lt(20)), (df[“age”].ge(20) & df[“age”].lt(30)), (df[“age”].ge(30) & df[“age”].lt(50)), (df[“age”].ge(50)), ] choices = [“baby”, “kid”, “young”, “mature”, “grandpa”] df[“elderly”] = np.select(conditions, choices) # Results in: # name age preTestScore postTestScore elderly # 0 Jason 42 4 25 mature # 1 Molly 52 24 94 grandpa # … Read more

How to merge multiple dataframes

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren’t involved. Just simply merge with DATE as the index and merge using OUTER method (to get all the data). import pandas as pd from functools import reduce df1 = pd.read_table(‘file1.csv’, sep=’,’) df2 = pd.read_table(‘file2.csv’, sep=’,’) df3 = pd.read_table(‘file3.csv’, sep=’,’) Now, … Read more

Why does one hot encoding improve machine learning performance? [closed]

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain. Suppose you have a dataset having only a single categorical feature “nationality”, with values “UK”, “French” and “US”. Assume, without loss of … Read more

Peak signal detection in realtime timeseries data

Robust peak detection algorithm (using z-scores) I came up with an algorithm that works very well for these types of datasets. It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is … Read more

Python: pandas merge multiple dataframes

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren’t involved. Just simply merge with DATE as the index and merge using OUTER method (to get all the data). import pandas as pd from functools import reduce df1 = pd.read_table(‘file1.csv’, sep=’,’) df2 = pd.read_table(‘file2.csv’, sep=’,’) df3 = pd.read_table(‘file3.csv’, sep=’,’) Now, … Read more

How to sort a dataFrame in python pandas by two or more columns?

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same: df.sort_values([‘a’, ‘b’], ascending=[True, False]) You can use the ascending argument of sort: df.sort([‘a’, ‘b’], ascending=[True, False]) For example: In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=[‘a’,’b’]) … Read more