How to plot empirical cdf (ecdf)

If you like linspace and prefer one-liners, you can do:

plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))

Given my tastes, I almost always do:

# a is the data array
x = np.sort(a)
y = np.arange(len(x))/float(len(x))
plt.plot(x, y)

Which works for me even if there are >O(1e6) data values.
If you really need to downsample I’d set

x = np.sort(a)[::down_sampling_step]

Edit to respond to comment/edit on why I use endpoint=False or the y as defined above. The following are some technical details.

The empirical CDF is usually formally defined as

CDF(x) = "number of samples <= x"/"number of samples"

in order to exactly match this formal definition you would need to use y = np.arange(1,len(x)+1)/float(len(x)) so that we get
y = [1/N, 2/N ... 1]. This estimator is an unbiased estimator that will converge to the true CDF in the limit of infinite samples Wikipedia ref..

I tend to use y = [0, 1/N, 2/N ... (N-1)/N] since:

(a) it is easier to code/more idiomatic,

(b) but is still formally justified since one can always exchange CDF(x) with 1-CDF(x) in the convergence proof, and

(c) works with the (easy) downsampling method described above.

In some particular cases, it is useful to define

y = (arange(len(x))+0.5)/len(x)

which is intermediate between these two conventions. Which, in effect, says “there is a 1/(2N) chance of a value less than the lowest one I’ve seen in my sample, and a 1/(2N) chance of a value greater than the largest one I’ve seen so far.

Note that the selection of this convention interacts with the where parameter used in the plt.step if it seems more useful to display
the CDF as a piecewise constant function. In order to exactly match the formal definition mentioned above, one would need to use where=pre the suggested y=[0,1/N..., 1-1/N] convention, or where=post with the y=[1/N, 2/N ... 1] convention, but not the other way around.

However, for large samples, and reasonable distributions, the convention is given in the main body of the answer is easy to write, is an unbiased estimator of the true CDF, and works with the downsampling methodology.

Leave a Comment