```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
import seaborn
```

# Using the `scipy`

Kolmogorov–Smirnov test¶

A short note on how to use the `scipy`

Kolmogorov–Smirnov test because quite frankly the documentation was not compelling and I always forget how the `scipy.stats`

module handles distributions arguments.

## Set up some test data¶

Using a `beta`

distribution generate `rvs`

from it and save them as `d1`

, then make a second set of data containing the fist set and 50 drawd from a different distribution: this measn `d2`

not `beta`

. We plot the Gaussian KDE to show how large the effect is graphically

```
xvals = np.linspace(0, 0.5, 500)
args = [0.5, 20]
d1 = ss.beta.rvs(args[0], args[1], size=1000)
noise = ss.norm.rvs(0.4, 0.1, size=50)
d2 = np.append(d1[:len(d1)-len(noise)], noise)
fig, (ax1, ax2) = plt.subplots(nrows = 2)
ax1.plot(xvals, ss.gaussian_kde(d1).evaluate(xvals), "-k",
label="d1")
ax1.legend()
ax2.plot(xvals, ss.gaussian_kde(d2).evaluate(xvals), "-k",
label="d2")
ax2.legend()
plt.show()
```

## Now compute KS test¶

The trick here is to give the `scipy.stats.kstest`

the data, then a callable function of the cumulative distribution function. There is some clever way to just pass in a string of the name, but I was unconvinved that the argumements were going to the proper place, so I prefer this method

```
beta_cdf = lambda x: ss.beta.cdf(x, args[0], args[1])
ss.kstest(d1, beta_cdf)
```

```
ss.kstest(d2, beta_cdf)
```

## Conclusion¶

Clearly this works as expected: for `d1`

we get a fairly large $p$-value suggesting we should accept the null hypothesis (the data *is* beta-distributed with the given $a$ and $b$, while for $d2$ we get a low $p$-value suggesting we might discard the null hypothesis (which we know to be true since that was the way the data was built).

It remains unclear how to choose the $p$ threshold.