Detect and exclude outliers in a pandas DataFrame

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

  • For each column, it first computes the Z-score of each value in the
    column, relative to the column mean and standard deviation.
  • It then takes the absolute Z-score because the direction does not
    matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the
    constraint.
  • Finally, the result of this condition is used to index the dataframe.

Filter other columns based on a single column

  • Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]

Leave a Comment