%matplotlib inline import matplotlib.pyplot as plt import numpy as np import scipy.stats as ss import seaborn
scipy Kolmogorov–Smirnov test¶
A short note on how to use the
scipy Kolmogorov–Smirnov test because quite frankly the documentation was not compelling and I always forget how the
scipy.stats module handles distributions arguments.
Set up some test data¶
beta distribution generate
rvs from it and save them as
d1, then make a second set of data containing the fist set and 50 drawd from a different distribution: this measn
beta. We plot the Gaussian KDE to show how large the effect is graphically
xvals = np.linspace(0, 0.5, 500) args = [0.5, 20] d1 = ss.beta.rvs(args, args, size=1000) noise = ss.norm.rvs(0.4, 0.1, size=50) d2 = np.append(d1[:len(d1)-len(noise)], noise) fig, (ax1, ax2) = plt.subplots(nrows = 2) ax1.plot(xvals, ss.gaussian_kde(d1).evaluate(xvals), "-k", label="d1") ax1.legend() ax2.plot(xvals, ss.gaussian_kde(d2).evaluate(xvals), "-k", label="d2") ax2.legend() plt.show()
Now compute KS test¶
The trick here is to give the
scipy.stats.kstest the data, then a callable function of the cumulative distribution function. There is some clever way to just pass in a string of the name, but I was unconvinved that the argumements were going to the proper place, so I prefer this method
beta_cdf = lambda x: ss.beta.cdf(x, args, args) ss.kstest(d1, beta_cdf)
Clearly this works as expected: for
d1 we get a fairly large $p$-value suggesting we should accept the null hypothesis (the data is beta-distributed with the given $a$ and $b$, while for $d2$ we get a low $p$-value suggesting we might discard the null hypothesis (which we know to be true since that was the way the data was built).
It remains unclear how to choose the $p$ threshold.