Detect and exclude outliers in a pandas DataFrame

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

For each column, it first computes the Z-score of each value in the
column, relative to the column mean and standard deviation.
It then takes the absolute Z-score because the direction does not
matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all column satisfy the
constraint.
Finally, the result of this condition is used to index the dataframe.

Filter other columns based on a single column

Specify a column for the zscore, df[0] for example, and remove .all(axis=1).

df[(np.abs(stats.zscore(df[0])) < 3)]

Filter other columns based on a single column

More Related Contents:

Leave a Comment Cancel reply