Photo by Nerfee Mirandilla on Unsplash Motivation Hypothesis testing is used in many applications and the methodology seems quite straightforward. Often times, though, we tend to overlook the underlying assumptions and need to ask: Are we comparing apples to oranges?


The question also arises when data scientists decide to discard observations based on missing features. Imagine we have features f1, f2,… fn and a binary target variable y. Assuming many observations have missing information for one or more features, we decide to drop these observations rows.

By doing so we might have altered the distribution of a feature fk.

Does discarding observations change the distribution of feature s?

Is this change significant?

In this article, we are going to present some assumptions of the t-test and how the Kolmogorov—Smirnov KS test can validate or discredit those assumptions. That being said, it is crucial to state early on that the t-test and KS test are testing different things. For each step we will present the theory and implement the code in Python 3.

The t-test assumes that situations produce normal data that differ only in the sense that the average outcome in one situation is different from the average outcome of the other situation. That being said, if we apply the t-test to data drawn from a non-normal distribution, we are probably increasing the risk of errors.

Small Datasets With the Same Mean Consider the two randomly generated samples in the code block below: Both samples are generated from normal distributions having the same mean, however by visual inspection it is clear that both samples are different.

A t-test might not be able to pick up on this difference and confidently say that both samples are identical. A t-test with scipy.

We therefore cannot reject the null hypothesis of identical average scores.

Say we generate two datasets that differ in mean, but a non-normal distribution masks the difference as shown in the code below: If we knew in advance that the data was not normally distributed we would not be using the t-test to begin with. With this idea in mind, we introduce a method to check if our observations come from a reference probability distribution.

The KS test can be used to compare a sample with a reference probability distribution, or to compare two samples. Suppose we have observations x1, x2, …xn that we think come from a distribution P.

Distributions such as the normal distribution are known to have a mean of 0 and a standard deviation of 1. More specifically, we will use the Empirical Distribution Function EDF : an estimate of the cumulative distribution function that generated the points in the sample.

The usefulness of the CDF is that it uniquely characterizes a probability distribution. Test if Sample Belongs to Distribution In the first example let the null hypothesis be that our samples come from a normal distribution N 0,1.

We want to compare the empirical distribution function of the observed data, with the cumulative distribution function associated with the null hypothesis.

