data-analysis - w3toppers.com

case_when function from R to Python

You want to use np.select: conditions = [ (df[“age”].lt(10)), (df[“age”].ge(10) & df[“age”].lt(20)), (df[“age”].ge(20) & df[“age”].lt(30)), (df[“age”].ge(30) & df[“age”].lt(50)), (df[“age”].ge(50)), ] choices = [“baby”, “kid”, “young”, “mature”, “grandpa”] df[“elderly”] = np.select(conditions, choices) # Results in: # name age preTestScore postTestScore elderly # 0 Jason 42 4 25 mature # 1 Molly 52 24 94 grandpa # … Read more

Plot pandas dataframe containing NaNs

The reason your not seeing anything is because the default plot style is only a line. But the line gets interupted at NaN’s so only multiple consequtive values will be plotted. And the latter doesnt happen in your case. You need to change the style of plotting, which depends on what you want to see. … Read more

How do I lag columns in MySQL?

Here is a solution that returns what you want in MySQL SET @a :=0; SET @b :=2; SELECT r.id, r.value, r.value/r2.value AS ‘lag’ FROM (SELECT if(@a, @a:=@a+1, @a:=1) as rownum, id, value FROM results) AS r LEFT JOIN (SELECT if(@b, @b:=@b+1, @b:=1) as rownum, id, value FROM results) AS r2 ON r.rownum = r2.rownum MySQL … Read more

How to merge multiple dataframes

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren’t involved. Just simply merge with DATE as the index and merge using OUTER method (to get all the data). import pandas as pd from functools import reduce df1 = pd.read_table(‘file1.csv’, sep=’,’) df2 = pd.read_table(‘file2.csv’, sep=’,’) df3 = pd.read_table(‘file3.csv’, sep=’,’) Now, … Read more

Why does one hot encoding improve machine learning performance? [closed]

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain. Suppose you have a dataset having only a single categorical feature “nationality”, with values “UK”, “French” and “US”. Assume, without loss of … Read more

Fitting polynomial model to data in R

To get a third order polynomial in x (x^3), you can do lm(y ~ x + I(x^2) + I(x^3)) or lm(y ~ poly(x, 3, raw=TRUE)) You could fit a 10th order polynomial and get a near-perfect fit, but should you? EDIT: poly(x, 3) is probably a better choice (see @hadley below).

How do I sum values in a column that match a given condition using pandas?

The essential idea here is to select the data you want to sum, and then sum them. This selection of data can be done in several different ways, a few of which are shown below. Boolean indexing Arguably the most common way to select the values is to use Boolean indexing. With this method, you … Read more

Peak signal detection in realtime timeseries data

Robust peak detection algorithm (using z-scores) I came up with an algorithm that works very well for these types of datasets. It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is … Read more

Python: pandas merge multiple dataframes

How to sort a dataFrame in python pandas by two or more columns?

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same: df.sort_values([‘a’, ‘b’], ascending=[True, False]) You can use the ascending argument of sort: df.sort([‘a’, ‘b’], ascending=[True, False]) For example: In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=[‘a’,’b’]) … Read more