How to qcut with non unique bin edges?

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).

The solutions are:

1 – Use pandas >= 0.20.0 that has this fix. They added an option duplicates="raise"|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

2 – Decrease the number of quantiles. Less quantiles means more elements per quantile

3 – Rank your data with DataFrame.rank(method=’first’). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method=’first’)

Example:

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

Then use this instead:

pd.qcut(df.rank(method='first'), nbins)

4 – Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

5 – Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

Leave a Comment