That ain't dancing Sally!

Same difference.

Congratulations! Your simulation code from the previous post has impressed all and sundry and you've been asked to teach introductory statistics to first year students at the prestigious Vandelay University.

You've got 2 stats classes, one with a group of 65 students, and another with a group of 35 students. We assume that the students have been randomly allocated to each group, and that they share the same final exam.

The difference between the groups is the teaching technique employed. With group1 you teach a standard stats course, while with group2 you sometimes use interpretive dance instead of equations to explain statistical concepts.


Let's jump right in.

from math import *
import numpy as np

# suppress warning messages.
import warnings
warnings.filterwarnings('ignore')

# import the scipy module which comtains scientific and stats functions.
import scipy.stats as stats

# usual plotting stuff.
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

# set the matplotlib style 
plt.style.use("seaborn-darkgrid")

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Here are the marks (out of 100) in the final exam for these two groups.

group1 = np.array([57, 65, 54, 53, 81, 75, 61, 63, 68, 73, 71, 45, 63, 62, 74, 69, 37,
       55, 78, 55, 54, 65, 95, 48, 53, 61, 63, 65, 60, 74, 52, 50, 33, 39,
       100, 66, 46, 61, 99, 63, 34, 57, 43, 49, 45, 63, 64, 51, 63, 74, 60,
       62, 59, 71, 74, 49, 42, 60, 68, 52, 69, 59, 60, 67, 69])

group2 = np.array([40, 50, 25, 43, 37, 36, 62, 58, 58, 62, 45, 44, 40, 43, 53, 51, 38,
       59, 55, 62, 71, 40, 60, 49, 61, 53, 74, 49, 73, 51, 59, 52, 98, 57,
       52])
print(group1.size, group2.size)
65 35

When you compare the marks between the two groups you notice that, on average, the marks in group2 are lower that those in group1.

diff = group1.mean() - group2.mean()
print(diff)
7.934065934065934

We can visualise the distribution of marks in both groups using a box plot.

fig = plt.figure(figsize=(8, 6))
plt.boxplot((group1, group2))
plt.xlabel("group", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("mark", fontsize=20)
plt.yticks(fontsize=14)
plt.show()

png

Is this difference in means statistically significant (i.e., should you drop the interpretive dance routine)?

Statistics should be able to provide us with an answer. Thankfully, some of my best friends are statisticians. I found one of those magnificent statistical creatures, and I asked them about differences between means of two groups. They mumbled something about beer, students, and tea test or something… I couldn’t tell what they were saying so I went to scipy instead and it turns out there’s an app for that.

The t-test is a statistical test that can tell us whether the difference in means between our two groups is statistically significant.

First, let us observe that the two groups have similar variances (this allows us to run a particular flavour of the test).

group1.var(), group2.var()
(179.0248520710059, 175.55102040816323)

Close enough for us. Let’s run the test.

t, p = stats.ttest_ind(group1, group2, equal_var=True)

print(f'Probability that a difference at least as extreme as {diff:0.2f} is due to chance (t test): {p*100:.2f}%')
Probability that a difference at least as extreme as 7.93 is due to chance (t test): 0.60%
But what does it all mean?

To the simulation!

We are trying to test whether there is a genuine (statistically significant) difference between the two groups.

One way that we can test this is to estimate how likely we are to observe a difference between the means of the two groups of at least 7.93, if we assume that there’s no difference in marks between the two groups (null hypothesis).

We can accomplish that by pooling all the values together and randomly shuffling (assigning) 65 of these values to group1 and the rest to group2.

Using this shuffling scheme, we will get an average difference between the two groups around zero, and the spread of the values we get will tell us how extreme (i.e., unlikely) a value of 7.93 or larger would be under the null hypothesis.

Let’s randomly shuffle the data across the two groups and compute the difference in means 100,000 times. (this may take a moment)

N = 100000

np.random.seed(12345)

# Let's pool the marks together
a = np.concatenate((group1, group2))

i = group1.size

L = []
for _ in range(N):
    # shuffle the data using random permutation (most useless code comment ever!)
    shuffle = np.random.permutation(a)
    
    # split the shuffled data into 2 groups
    grp1, grp2 = shuffle[:i], shuffle[i:]
    
    # compute the difference in means
    L.append(np.mean(grp1) - np.mean(grp2))
    
L = np.array(L)

Let’s plot a histogram of the results.

plt.figure(figsize=(12, 6))
plt.hist(L, bins=50, normed=False, edgecolor='w')
plt.title('Difference between group means', fontsize=18)
plt.axvline(x=diff, ymin=0, ymax=1, color='r')
plt.annotate("Observed\ndifference\n($7.93$)", xy=(8.5, 5000), color='r', fontsize=14)
plt.xlabel("difference in means", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("count", fontsize=20)
plt.yticks(fontsize=14)
plt.show()

png

On the histogram, we see that the observed difference is quite far from the mode of the distribution.

In other words, it appears that a difference of 7.93 or more (in magnitude) does not occur very often. Let’s quantify this.

Proportion of simulated trials where the (absolute value of the) difference exceeds the observed difference.

pSim = np.mean((np.abs(L) > diff))
print(f'Probability that the difference at least as extreme as {diff:0.2f} is due to chance (Simulation): {pSim*100:.2f}%')
Probability that the difference at least as extreme as 7.93 is due to chance (Simulation): 0.58%
This is not too bad considering the true result is 0.60%.

This result means that if we assume that the two groups are sampled from the same population of students, the probability of observing a difference in means of at least 7.93 between the group just by random chance is only around 0.6%.

It is quite typical for the threshold for statistical significance to be set at 5%. Therefore, in this case, we’d conclude that the difference between the two groups is statistically significant. In other words, the teaching method has an impact on the marks. You might want to put that leotard away, stop the gyrations cause that ain’t dancing Sally!

Happy birthday to you!


The Birthday Problem is a classic problem in probability theory. We will use it to illustrate how to use simulation to get an estimate of the probability of some event occuring.

Imagine that you are at a party with $N$ people (including yourself). Let's assume that we have 365 days in a year and that the birthday of the people attending the party is randomly distributed with a uniform distribution, meaning that each of the 365 days is equally probable as a birthday (if you were born on February 29, you'll need to go to another party).

How large does $N$ need to be for the probability of having at least 2 people share a birthday to be at least 50%? What about 90%?

(Note that this is different from asking for the probability of anyone sharing a birthday with a given person)

We can easily work out the edge cases. We know that if we have 1 person at the party, the probability of 2 or more people sharing a birthday is 0, and if we have 366 or more people, that probability will be 1 owing to the pigeonhole principle.

It also makes intuitive sense that the probability of shared birthdays increases monotonically as the number of guests increases, therefore, we know that we will reach the 50% threshold somewhere between 2 and 365 guests. Not a great guesstimate by any means, but nonetheless, it gives us a starting point.

This is the kind of problem that can be solved using probability theory, however, let's see if we can get to the answer using simulation.

Let's break the problem down into pieces.

from math import *
import numpy as np

# suppress warning messages.
import warnings
warnings.filterwarnings('ignore')

# import the scipy module which comtains scientific and stats functions.
import scipy as sp

import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

# set the matplotlib style 
plt.style.use("seaborn-darkgrid")

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

First, we define a numpy array called days that contains the day numbers for the year (1 to 365).

days = np.arange(1, 366)
days
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
       183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
       196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208,
       209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
       222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
       235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247,
       248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260,
       261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273,
       274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286,
       287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
       300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
       313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325,
       326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338,
       339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351,
       352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364,
       365])

Assuming there are $N=50$ people at the party, let’s create an array guests_bday that will contain $N$ day numbers randomly drawn (with replacement) from the array days.

We’ll use the random seed 42 to ensure that ours results are reproducible.

N = 50
np.random.seed(42)

guests_bday = np.random.choice(days, N)
guests_bday
array([103, 349, 271, 107,  72, 189,  21, 103, 122, 215, 331,  88, 100,
       360, 152, 131, 150, 309, 258, 344, 294, 192, 277, 161, 314,  22,
       253, 236, 345,  49,  59, 170, 188, 271, 190, 175,  51, 364,  55,
       244, 320, 131, 307, 135,  21, 329, 167, 274,  89, 316])

The number of unique birthdays in guests_bday is given by

num_unique = np.unique(guests_bday).size
num_unique
46

In this example, we have 46 unique birthdays for 50 guests.

Now that we know how to compute the number of unique birthdays for a particular sample of $N$ guests, let’s write a function shared_birthdays that takes integers $N$ and n_trials as parameters, and returns the proportion of trials where two or more people share a birthday.

def shared_birthdays(N, n_trials=1000, seed=42):
    np.random.seed(seed)
    return np.mean([np.unique(np.random.choice(days, size=N)).size < N for _ in range(n_trials)])

Rather that manually changing the number of people at the party, let’s consider all possibilities between 0 and 100 at once (Hang on! A party with 0 people? “It happens”, the poor wretch replied, quickly looking away to wipe a tear with a désinvolture that belied his inner turmoil…).

number_of_people = range(101)
sim = np.array([shared_birthdays(N) for N in number_of_people])

We can now find the threshold above which we have a probability of shared birthday > 50%.

idx = np.where(sim > 0.5)[0][0]
number_of_people[idx]
23

According to our computation, if we have 23 or more people at the party, the probability of at least 2 people sharing a birthday is larger than 50%.

Similarly, to reach the 90% threshold…

idx = np.where(sim > 0.9)[0][0]
number_of_people[idx]
41

… we need at least 41 guests.

Let’s plot the result of the simulation

fig = plt.figure(figsize=(12, 6))
plt.plot(number_of_people, sim, 'g')
plt.axhline(y=0.5, c='r', alpha=0.5, linestyle='--')
plt.axhline(y=0.9, c='r', alpha=0.5, linestyle='--')
plt.xticks(number_of_people[::5], fontsize=14)
plt.yticks(np.arange(0, 1.05, 0.1), fontsize=14)
plt.ylim(0, 1.01)
plt.annotate("50%", xy = (93, 0.45), fontsize=14)
plt.annotate("90%", xy = (93, 0.85), fontsize=14)
plt.xlabel('number of guests', fontsize=16)
plt.ylabel('probability', fontsize=16)
plt.title('Probability of at least 2 people sharing a birthday', fontsize=18)
plt.show()

png

So how can we check these results against the exact probability? When trying to determine the probability of a particular event occuring, it's often useful to consider the probability of the event __not__ occuring. Since these are two mutually exclusive events, their probabilities must sum to 1.

In our case, given N < 366 guests, the probability of no two guests sharing a birthday can be computed in the following way.

Assume that no two guests share a birthday. Let's consider a guest which we'll label as guest1. This guest will take one "birthday slot" out of 365. We then choose another guest, guest2, whose birthday must fall on one of the 364 unoccupied slots left. We then choose another guest, guest3, whose birthday must fall on one of the 363 unoccupied slots left. And on and on till we choose guestN whose birthday must fall on one of the (365-N+1) unoccupied slots left (recall that N < 366).

We can summarise this by saying that there are $$365\times364\times\cdots\times(365-N+1)=\frac{365!}{(365-N)!}$$ possible combinations of guests and birthdays that don't lead to a shared birthday.

Since, we're interested in (probabilities) proportions not number of configurations, we need to divide this number by the total number of birthday configurations for $N$ guests, which is just $365^N$ (each guest can have any birthday independently of the others). However, recall that this is the probability of no shared birthdays. To get the result we want we need to subtract this probability from 1. Hence, the exact (under the assumptions of the problem) probability of at least 2 people among $N$ guests sharing a birthday is:$$1 - \frac{365!}{365^N(365-N)!}$$

Let's compute this probability.

def exact_probability_shared_bday(N):
    return 1 - factorial(365)/factorial(365-N)/365**N

exact = np.array([exact_probability_shared_bday(N) for N in number_of_people])

idx = np.where(exact > 0.5)[0][0]
print(f"We need {number_of_people[idx]} guests to reach the 50% threshold")
idx = np.where(exact > 0.9)[0][0]
print(f"We need {number_of_people[idx]} guests to reach the 90% threshold")
We need 23 guests to reach the 50% threshold
We need 41 guests to reach the 90% threshold

We get the same results as with the simulation!

Let’s plot the two results on the same graph.

fig = plt.figure(figsize=(12, 6))
plt.plot(number_of_people, sim, 'g', lw=2, alpha=0.8, label='simulation')
plt.plot(number_of_people, exact, 'k--', lw=1, label='exact')
plt.axhline(y=0.5, c='r', alpha=0.5, linestyle='--')
plt.axhline(y=0.9, c='r', alpha=0.5, linestyle='--')
plt.xticks(number_of_people[::5], fontsize=14)
plt.yticks(np.arange(0, 1.05, 0.1), fontsize=14)
plt.ylim(0, 1.01)
plt.annotate("50%", xy = (93, 0.45), fontsize=14)
plt.annotate("90%", xy = (93, 0.85), fontsize=14)
plt.xlabel('number of guests', fontsize=16)
plt.ylabel('probability', fontsize=16)
plt.title('Probability of at least 2 people sharing a birthday', fontsize=18)
plt.legend(loc=4, fontsize=16)
plt.show()

png

Can we generalise our simulation beyond pairs?

The first thing to notice is that our initial strategy isn’t going to work anymore.

np.unique(np.random.choice(days, size=N)).size < N basically checks whether we have more guests than unique birthdays. This is an easy way to check whether at least two guests share a birthday, but it will fail to produce the correct answer if we’re looking to find out know whether at least M guests share a birthay with M > 2.

Instead of looking at the difference between number of guests and number of unique birthdays, let’s count the number of unique birthdays directly. We can do that with numpy’s bincount function.

While we’re at it, we might as well look at variations of this problem, the probability that the maximum number of of people sharing a birthday is M, or the probability of having at most M people share a birthday.

First, let’s illustrate how bincount works.

np.random.seed(42)
a = np.random.randint(0, 10, 20)
a
array([6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7, 7, 2, 5, 4, 1, 7, 5, 1])
np.bincount(a)
array([0, 2, 2, 2, 3, 2, 3, 5, 0, 1])

This gives us the number of occurences of each value from 0 to 9 in a.

We can make things more explicit in the following way:

for value, count in zip(np.arange(a.max()+1), np.bincount(a)):
    print(f"The value {value} appears {count} times in a.")
The value 0 appears 0 times in a.
The value 1 appears 2 times in a.
The value 2 appears 2 times in a.
The value 3 appears 2 times in a.
The value 4 appears 3 times in a.
The value 5 appears 2 times in a.
The value 6 appears 3 times in a.
The value 7 appears 5 times in a.
The value 8 appears 0 times in a.
The value 9 appears 1 times in a.

Let’s use bincount to count the largest number of people sharing a birthday in each of our trials.

def at_least_M_shared_birthdays(N, M=2, n_trials=1000, seed=42):
    np.random.seed(seed)
    return np.mean([np.bincount(np.random.choice(days, size=N)).max() >= M for _ in range(n_trials)])

def at_most_M_shared_birthdays(N, M=2, n_trials=1000, seed=42):
    np.random.seed(seed)
    # the only difference with the previous function is >= becomes <=
    return np.mean([np.bincount(np.random.choice(days, size=N)).max() <= M for _ in range(n_trials)])

def a_maximum_of_M_shared_birthdays(N, M=2, n_trials=1000, seed=42):
    np.random.seed(seed)
    # the only difference with the first function is >= becomes ==
    return np.mean([np.bincount(np.random.choice(days, size=N)).max() == M for _ in range(n_trials)])

# put the three functions in a dictionary to make it easier to choose between them later.
shared_bday = {}
shared_bday['at least']      = at_least_M_shared_birthdays
shared_bday['at most']       = at_most_M_shared_birthdays
shared_bday['a maximum of']  = a_maximum_of_M_shared_birthdays

We create a utility function to run the simulation and plot the results.

Recall that M is the number of people sharing a birthday.

def plot_shared_birthday_prob(M=2, max_num_guests=100, how='at least'):

    number_of_guests = np.arange(1, max_num_guests+1)
    
    method = shared_bday[how] 

    sim = np.array([method(N, M=M) for N in number_of_guests]) 
    
    # let's get the 0.5 and 0.9 thresholds if they are reached.
    try:
        threshold50 = number_of_guests[np.where(sim > 0.5)[0][0]] 
    except IndexError:
        threshold50 = None
        
    try:
        threshold90 = number_of_guests[np.where(sim > 0.9)[0][0]] 
    except IndexError:
        threshold90 = None   
    
    step = ceil(max_num_guests / 10)
    
    fig = plt.figure(figsize=(12, 6))
    plt.plot(number_of_guests, sim, 'g')
    plt.axhline(y=0.5, c='r', alpha=0.5, linestyle='--')
    plt.axhline(y=0.9, c='r', alpha=0.5, linestyle='--')
    plt.xticks(np.array(number_of_guests)[::step]-1, fontsize=14)
    plt.xlim(1, max_num_guests+1)
    plt.yticks(np.arange(0, 1.05, 0.1), fontsize=14)
    plt.ylim(0, 1.01)
    plt.xlabel('number of guests', fontsize=16)
    plt.ylabel('probability', fontsize=16)
    plt.title(f'Probability of {how} {M} people sharing a birthday', fontsize=18)

    if threshold50 is not None and how == "at least":
        print(f"50% threshold for {M} shared birthdays reached for {threshold50} guests.")
    if threshold90 is not None and how == "at least":
        print(f"90% threshold for {M} shared birthdays reached for {threshold90} guests.")
    
    return fig

As a sanity check, let’s redo the computation for pairs (M=2).

fig = plot_shared_birthday_prob(M=2, max_num_guests=100)
50% threshold for 2 shared birthdays reached for 23 guests.
90% threshold for 2 shared birthdays reached for 41 guests.

png

Looking good!

With this new code we can now ask a different question.

What is the probability that we have at most than 2 people sharing a birthday?

fig = plot_shared_birthday_prob(M=2, max_num_guests=100, how='at most')

png

This is a monotonically decreasing function of the number of guests (aside from the random noise), which makes sense.

This isn’t the most interesting of results so we’ll skip it henceforth.

Let’s look at the probability of the maximum number of shared birthdays being 2.

fig = plot_shared_birthday_prob(M=2, max_num_guests=100, how='a maximum of')

png

That’s interesting. The probability starts off like the probability of observing at least 2 people sharing a birthday, but it never reaches the 90% threshold. Instead, after around 45 or so guests the probability starts decreasing. This of course makes sense, as the number of guests increases, we reach a point where having more than 2 people share a birthday is no longer unlikely. When the guests number is larger than 87, the probability of having only singletons (unique birthdays) or pairs dips below 50%. Let’s keep this in mind…

What about trios?

fig = plot_shared_birthday_prob(M=3, max_num_guests=250)
50% threshold for 3 shared birthdays reached for 87 guests.
90% threshold for 3 shared birthdays reached for 132 guests.

png

Notice that the probability of having 3 or more people sharing a birthday reach the 50% mark for 87 guests.

This is the same value for which the probability of having no more than 2 people share a birthday drops below 50%.

This is not a coincidence as these two events are mutually exclusive. You can either have at most 2 people sharing a birthday ($M \leqslant 2$) or more than 2 people sharing a birthday ($M>2$ which is equivalent to $M \geqslant 3$ since we don’t usually chop up the guests into pieces).

What about having a maximum of 3 people share a birthday?

fig = plot_shared_birthday_prob(M=3, max_num_guests=250, how='a maximum of')

png

Similarly to the pairs, the trios see their probability wane after about 130 guests, and it dips below 50% at about 188 guests. We can now guess what happens, as the number of guests reaches 188, the probability of having 4 or more people share a birthday is more likely than not. Which takes us to…

Quartets

fig = plot_shared_birthday_prob(M=4, max_num_guests=500)
50% threshold for 4 shared birthdays reached for 188 guests.
90% threshold for 4 shared birthdays reached for 259 guests.

png

fig = plot_shared_birthday_prob(M=4, max_num_guests=500, how='a maximum of')

png

Once again we get that now familiar shape (notice that the shape becomes more symmetrical as M increases).

The decrease in probability heralds the arrival of…

Quintets

fig = plot_shared_birthday_prob(M=5, max_num_guests=500)
50% threshold for 5 shared birthdays reached for 314 guests.
90% threshold for 5 shared birthdays reached for 413 guests.

png

fig = plot_shared_birthday_prob(M=5, max_num_guests=1000, how='a maximum of')

png

Sextets

(This is starting to take some time to run…)

fig = plot_shared_birthday_prob(M=6, max_num_guests=1000)
50% threshold for 6 shared birthdays reached for 457 guests.
90% threshold for 6 shared birthdays reached for 586 guests.

png

fig = plot_shared_birthday_prob(M=6, max_num_guests=1000, how='a maximum of')

png

We are barely reaching the 50% threshold now.

And one last one for the road.

Septets

fig = plot_shared_birthday_prob(M=7, max_num_guests=1000)
50% threshold for 7 shared birthdays reached for 616 guests.
90% threshold for 7 shared birthdays reached for 768 guests.

png

fig = plot_shared_birthday_prob(M=7, max_num_guests=1000, how='a maximum of')

png

It looks like were we to smoothen out this plot, the probability would never reach 50%.

There you have it!

Hopefully, this little exploration of the birthday problem illustrates how useful simulation can be. In the case of the generalised version of the problem beyond pairs, we were able to simulate the problem easily by making a small modification to our original code.

Perhaps one question remains. How can we be sure that our simulation gives a reasonable (however we want define it) approximation of the answer?

Well, for pairs we compared our code to the exact result. Can we do the same for M > 2?

It turns out that there exists an exact answer to this problem for the general case, however it is much more involved than our little derivation for M = 2.

For more information I’ll refer you to this paper by Fisher, Funk, and Sams.

We can nonetheless compare our simulation to the authors’ exact result for the 50% threshold for the values of M they used:

| M              | 2  | 3  | 4   | 5   | 6   | 7   |
|----------------|----|----|-----|-----|-----------|
| exact          | 23 | 88 | 187 | 313 | N/A | N/A |
| our simulation | 23 | 87 | 188 | 314 | 457 | 616 |

We’re in the right ball park. We could probably improve our results a bit by using more than 1000 trials.

It might also be interesting to record the variance across the trials during our simulation.

This, as they say, will be left as an exercise for the reader, and the next time you’re at a birthday party, if there are more than 23 people present be sure to sing “Haaaaappy birthday to youse!“, the pairs, trios, quartets, etc… will be grateful (the person organising the birthday party, not so much, oddly enough. There’s no pleasing some people anyway).

Do you feel lucky?


A coin flipped 100 times comes up heads 60 times. How likely is it that the coin is fair?

This is a basic example of statistical inference, trying to infer an estimate of some quantity (here the probability for heads) from some data (observation of 100 coin flips). This type of problem is invariably described in every first year stats course. To solve it, we first need to represent the underlying process generating our observations in a mathematical form. For coin flips, the binomial distribution is the natural choice. Let's affectionately call our random variable representing the number of heads $X$.

The probability of observing k heads in a run of N coin flips is given by the probability mass function (pmf):

$$P(X=k) = \left(\begin{array}{c}N \\ k \end{array} \right)p^k\, (1-p)^{N-k} = \dfrac{N!}{k!(N-k)!}\,p^k\, (1-p)^{N-k}$$

But where does this formula come from? It's actually quite simple. If we denote by p the probability of heads (outcome H) for one coin flip, the probability of tails (outcome T) is $1-p$ because we only have 2 mutually exclusive outcomes (we assume the coin doesn't land sideways).

If we flip the coin 2 times, since each coin flip is independent, we multiply the probabilities of the outcomes of each flip. For instance, the probability of observing HH (two heads) is $p^2$. The probability associated with TT is $(1-p)^2$, while the one associated with HT is $p(1-p)$, etc...

So if we flip the coin $N$ times, what is the probability of observing $k$ heads? Well, it's the probability of observing any configuration with $k$ heads and $N-k$ tails, which is $p^k(1-p)^{N-k}$ times the number of such configurations. To count the configurations we simply need to formalise the fact that, for this problem, the ordering doesn't matter. As far as we are concerned, HTH, HHT, and THH are equivalent, they all represent 2 heads in 3 coin flips. Instead of thinking about one coin flipped $N$ times, let's consider the equivalent problem of $N$ identical coins flipped once each.

How many ways can we arrange the $N$ coins? We've got $N$ options when picking the first coin. Once that's done, we've got $N-1$ ways of choosing the second coin, and so on, all the way to the last coin for which there's only one choice left to make, namely that very coin... This means that there are $N\times(N-1)\times\cdots2\times1$ ways or arranging the $N$ coins, which is the factorial of $N$ or $N!$.

Now suppose that $k\leq N$ of these coins have landed heads up. This means of course that $N-k$ coins show tails. If we were to reorder the $k$ coins among themselves, this would not change our result. We'd still have $k$ heads out of $N$ flips. Same with the $N-k$ tails. Therefore, when it comes to counting the number of configurations for which we have $k$ heads, we should divide the total number of ways to arrange $N$ coins by the total number of ways we can rearrange the $k$ heads among themselves, and the $N-k$ tails among themselves. This ends up to be the binomial coefficient called "N choose k":

$$\left(\begin{array}{c}N \\ k \end{array}\right) = \frac{N!}{k!(N-k)!}.$$

Hence the formula above.

One minor detail is that we don't know the value of $p$, the probability that a coin flip lands heads up. For a fair coin, we of course expect that $p=0.5$. Using the formula above we can also compute the probability of 60 heads out of 100 coin flips for a fair coin.

This will be our starting in figuring out how to compute an estimate of $p$.

Let’s import all the modules we need.

# usual stuff
from math import *
import numpy as np

# suppress warning messages.
import warnings
warnings.filterwarnings('ignore')

# import the scipy module which contains scientific and stats functions.
import scipy as sp

# import matplotlib and redirect plots to jupyter notebook. (cm is for color maps)
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

# sets the default figure size 
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (10, 6)

# sets some default plotting styles.
import seaborn as sns
sns.set_style("darkgrid")

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Let’s write a Python function that computes the formula above.

N = 100
k = 60
p = 0.5

def choose(N, k):
    return int(factorial(N)/(factorial(k) * factorial(N-k) ))

def binom(N, k, p):
    return choose(N, k) * p**k * (1-p)**(N-k)

binom(N, k, p)
0.010843866711637987

So the pmf tells us that the probability of getting 60 heads out of 100 coin flips is a hair above 1%, if the coin is fair.

Note that this does not tell us if the coin is fair. So let’s determine how likely are we to observe 60 heads or more in a sequence of 100 flips from a fair coin.

There’s typically 2 ways that this problem is solved in introductory stats courses.

Method 1: Binomial Distribution

One-tailed binomial test.

Under the hypothesis that the coin is fair, the probability of observing at least 60 heads in a run of 100 coin flips can be computed using the one-tailed binomial test. One tailed means we only care for differences in one direction, either higher or smaller than the result that is expected under the hypothesis that the coin is fair. Here we only care about the probability of observing 60 or more heads. The scipy module has a suitable implementation of the two-tailed binomial test that we can divide by 2 to get the one-tailed result (note that we can divide the two-tailed result by 2 only because we are testing the hypothesis that the coin is fair and for p=0.5 the distribution is symmetric).

sp.stats.binom_test(k, n=N, p=0.5)/2
0.02844396682049039

That's around 2.8%. This number is nothing more that the sum of the probabilities for all the values of k between 60 and 100.

$$\sum_{k = 60}^{100}\left(\begin{array}{c}N\\k \end{array} \right)p^k\, (1-p)^{N-k}$$

Here's a check using our binom function if you don't believe me.

sum(binom(N, k, p) for k in range(60, 101))
0.028443966820490392

Note:

The sum version has the advantage of stating explicit what the computation is doing. We'll use this approach henceforth whenever we need to perform a binomial test.

Method 2: Normal approximation to the binomial distribution

One-tailed Z-test

When we have a large enough number of coin flips, we can approximate the discrete binomial distribution by a suitably defined continuous normal (gaussian) distribution with mean 50 ($=Np$) and standard deviation $\sqrt{Np(1-p)}$, courtesy of the famous central limit theorem.

This allows us to use the beloved Z test of introductory stats courses to determine the probability of observing at least 60 heads, under the hypothesis that the coin is fair. The Z test is just a way to compute how likely it is to observe a value of the the random variable that is beyond a given distance from its assumed mean, using units of the standard deviation.

Once again, scipy can help us by supplying the cumulative distribution function (cdf) for our normal distribution. The cdf is a function that tells us the probability of getting a value lower than, or equal to the value we are considering. The cdf obviously integrates (sums) to 1. By subtracting the cdf at a given value z_0 from 1, we get the probability of observing a result at least as large as z_0.

The probability we want is:

N = 100
k = 60
p = 0.5

1 - sp.stats.norm.cdf(k-0.5, loc=50, scale = sqrt(N*p*(1-p)))  # notice the -0.5?
0.028716559816001852

We could have computed this directly by first computing the Z score for k, which is nothing more than the observed count of heads minus its average for a fair coin (50), divided by its standard deviation.

mean = 50
sd = sqrt(N*p*(1-p)) # standard deviation (the mean is 50 for a fair coin)
zscore = (k-0.5 - mean)/sd  # notice the -0.5?

1 - sp.stats.norm.cdf(zscore)
0.028716559816001852

Note:

Why did we substract 0.5 from k? It called a continuity correction. We need it to compensate for the fact that we are substituting a continuous distribution for a discrete one.

Method 3: Simulation

This is all nice and peachy, but deep down, do we really understand everything we’ve done above? Coin flips are a fairly trivial concept, so our reasoning should be straightforward, however, what would we have done if we hadn’t been provided with the pmf of the binomial distribution? Since we derived from “first principles” above, it wouldn’t have been much of a problem. However, could we have worked out the normal approximation to the binomial distribution (a tad trickier perhaps, with that continuity-correction business)? There’s got to be a better way! Not really, but there’s a different way: simulation.

The question is how likely is it for a fair coin to land 60 or more times heads up in a sequence of 100 flips?

One way to find out is to flip a fair coin 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads… You get the gist.

If we repeat this a large enough number of times, we should get an estimate of how likely it is for a fair coin to land heads up 60 or more times in a run of 100 flips simply by computing the proportion of trials where we got 60 or more heads.

Let the computer flip some coins for us!

numpy has a very useful function called random.randint that will randomly pick an integer drawn from a uniform random distribution.

For instance, let 1 correspond to heads and 0 to tails, we can simulate 10 flips of a fair coin with:

np.random.randint(0, 2, size=10)
array([1, 1, 1, 1, 0, 0, 1, 0, 0, 1])

Note that we can simulate multiple trials of 10 flips by passing a tuple to the size parameter of the function.

Here’s for instance a series of 3 trials with 10 flips each.

np.random.randint(0, 2, size=(3, 10))
array([[1, 0, 1, 1, 0, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 0, 1, 0, 0]])

Let’s flip our fair coin 200 million times.

We’ll simulate 2 million trials of 100 coin flips. Since we get 1 for heads and 0 for tails, the number of heads in a trial is simply the sum of all the values observed in that trial.

We set the random seed so that we all get the same result, not that it matters that much here.

np.random.seed(0)

N = 100  # number of flips per trial
repeats = 2000000   # number of trials

# we flip the "coin" N times and compute the sum of 1s (heads).
# the whole thing is repeated "repeats" times.
coin_flips = np.random.randint(0, 2, size=(repeats, N)).sum(axis=1)

coin_flips.shape
(2000000,)

Let’s compute the normal approximation to the binomial distribution for our case.

(this is so that we can plot it alongside our histogram)

p = 0.5  # probability of heads for a fair coin
x = np.linspace(coin_flips.min(), coin_flips.max(), 100)   # 100 evenly spaced values between the min and max counts
ndist = sp.stats.norm.pdf(x, loc=p * N, scale = sqrt(N*p*(1-p))) # we use scipy to compute the normal approximation

Let’s plot the histogram of our simulation, as well as the normal approximation.

The red vertical line marks the observed result (number of heads).

plt.hist(coin_flips, bins=50, density=True, align='mid', edgecolor='w') # plot the histogram
plt.axvline(x=k, ymin=0, ymax=1, color='r', label="observed # heads")  # plot a vertical line at x = k
plt.xlabel('k', fontsize=16) # label for the x axis
plt.ylabel('$P(k)$', fontsize=16) # label for the y axis
plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16) # title of the figure
plt.plot(x, ndist, 'c', lw=3, label="normal approx.") # plot normal approximation curve
plt.legend(fontsize=14, loc=0)  # add legend
plt.show() # display the result

png

Let’s extract the proportion of our 100 coin flip trials which have 60 or more heads from our simulation.

np.mean(coin_flips >= k)
0.0285285

Note:

Computing the mean is the same as adding all the values for which the condition is true, which gives us the number of trials where we got at least k heads, and dividing by the number of trials, which gives us the proportion we're after.

np.sum(coin_flips >= k) / repeats
0.0285285

Note:

In the context of statistical inference (the frequentist kind), this probability is called a p-value. It gives us the probability of observing a result as or more extreme than the one observed ($X\geqslant 60$) if the null hypothesis ($p=0.5$) is true. It also sometimes gives statistics a bad name...

Not a bad estimate of the propability of having $X\geqslant 60$ under the (null) hypothesis that the coin is fair.

So it looks like our simulation works for 60 or more heads. It would be nice to check that it works for other values.

We could either change the value of k above, and re-run our code, or we could copy and paste the code and change the value of k, but neither approach is good programming practice. Whenever we realise that we’d like to run the same piece of code multiple times, for different values of parameters, we should wrap our code in a function.

Let’s do that.

def compute_pvalue(k=50):
    print("p-value",
          "\nOur simulation:         ", np.mean(coin_flips >= k),
          "\nScipy (Binomial test):  ", sum(binom(N, kvalue, 0.5) for kvalue in range(k, 101)),
          "\nScipy (Normal Approx):  ", 1 - sp.stats.norm.cdf(k-0.5, loc=50, scale = sqrt(N*p*(1-p))))
    
    plt.hist(coin_flips, bins=50, normed=True, edgecolor='w')
    plt.plot(x, ndist, 'c', lw=3)
    
    plt.axvline(x=k, ymin=0, ymax=1, color='r')
    
    plt.xlabel('k', fontsize=16)
    plt.ylabel('$P(k)$', fontsize=16)
    plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16)

    plt.show()

compute_pvalue(60)
p-value 
Our simulation:          0.0285285 
Scipy (Binomial test):   0.028443966820490392 
Scipy (Normal Approx):   0.028716559816001852

png

We've considered the probability that the number of heads is at least k in a run of 100 coin flips.

Let's write a version of the code above that will compute, and visualise the probability that the number of heads is exactly k in a run of 100 coin flips, under the hypothesis that the coin is fair.

def compute_pvalue_B(k=50):
    print("p-value",
          "\nOur simulation:         ", np.mean(coin_flips == k),
          "\nScipy (Binomial test):  ", binom(N, k, 0.5),
          "\nScipy (Normal Approx):  ", sp.stats.norm.cdf(k+0.5, loc=50, scale = sqrt(N*p*(1-p))) - sp.stats.norm.cdf(k-0.5, loc=50, scale = sqrt(N*p*(1-p))))
    
    plt.hist(coin_flips, bins=50, edgecolor='w', normed=True)
    plt.plot(x, ndist, 'c', lw=3)
    
    plt.axvline(x=k, ymin=0, ymax=1, color='r')
    
    plt.xlabel('k', fontsize=16)
    plt.ylabel('$P(k)$', fontsize=16)
    plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16)

    plt.show()
    
compute_pvalue_B(60)
p-value 
Our simulation:          0.0108845 
Scipy (Binomial test):   0.010843866711637987 
Scipy (Normal Approx):   0.010852139253185289

png

Let's now write a version of the code above that will compute, and visualise the probability that the number of heads is between two values k_1 and k_2 in a run of 100 coin flips, under the hypothesis that the coin is fair (assume that $0\leqslant k_1 \leqslant 50$ and $50 \leqslant k_2 \leqslant 100$).

def compute_pvalue(k1=40, k2=50):
    k1, k2 = min(k1, k2), max(k1, k2)
    print("p-value",
          "\nOur simulation:         ", np.mean(np.logical_and(coin_flips >= k1, coin_flips <= k2)),
          "\nScipy (Binomial test):  ", np.sum([binom(N, k, 0.5) for k in range(k1, k2+1)]),
          "\nScipy (Normal Approx):  ", sp.stats.norm.cdf(k2+0.5, loc=50, scale = sqrt(N*p*(1-p)))-
                                        sp.stats.norm.cdf(k1-0.5, loc=50, scale = sqrt(N*p*(1-p))) )
    
    plt.hist(coin_flips, bins=50, normed=True, edgecolor='w')
    plt.plot(x, ndist, 'c', lw=3)
    
    plt.axvline(x=k1, ymin=0, ymax=1, color='g')
    plt.axvline(x=k2, ymin=0, ymax=1, color='r')
    
    plt.xlabel('k', fontsize=16)
    plt.ylabel('$P(k)$', fontsize=16)
    plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16)

    plt.show()
    
compute_pvalue(k1=45, k2=60)
p-value 
Our simulation:          0.847064 
Scipy (Binomial test):   0.8467733878542303 
Scipy (Normal Approx):   0.8464695184908008

png


Dice roll

Aside from its usefulness in statistical inference, simulation is a nice way to explore problems which involve counting occurences of a certain event. Since this is the basic concept behind probability theory, simulation can be an convenient way to tackle probabilitic problems.

Let's look at a simple example.

Assuming that we have a regular, fair, six-sided die with sides numbered 1 to 6.

What is the probability of getting:

  • 6 when rolling the die once?
  • no 6 when rolling the die once?

What about the probability of getting:

  • no 6 when rolling the die twice?
  • a single 6 when rolling the die twice?
  • two 6s when rolling the die twice?

We can of course enumerate (count) all the ways you can obtain the required outcome, and normalise (divide) this by all the possible outcomes, but it’s not a great general strategy. A better way is to work out the exact mathematical formula for the probability we’re interested in. We’ll do that shortly, but first let’s just simulate the process of rolling dice.

Let's create a function roll_dice that takes two integers n_rolls and n_trials as arguments, and returns an array with the number of 6s obtained in each of the n_trials trials when we roll n_rolls dice at the same time.

For instance, roll_dice(2, 10) would return an array of 10 values representing the number of 6s observed when rolling 2 dice in each of the 10 trials.

We'll fix the random seed for reproducibility.

def roll_dice(n_rolls, n_trials):
    np.random.seed(42)
    die_rolls = (np.random.randint(1, 7, size = (n_trials, n_rolls)) == 6).sum(axis=1)
    return die_rolls
roll_dice(2, 10)
array([0, 0, 0, 0, 0, 0, 1, 0, 2, 0])

Note that the values in the array represent the number of 6s observed in each of the 10 trials.

The function below will visualise the result of roll_dice as a histogram.

It uses the roll_dice function to compute the counts of 6s across n_trials trials by default.

It will also output the numerical value for the proportion of 6s in each case.

def plot_histogram(n_rolls = 3, n_trials = 100000):
    np.random.seed(12345)
    rolls = roll_dice(n_rolls=n_rolls, n_trials=n_trials)
    values = plt.hist(rolls, 
                      bins=np.arange(n_rolls+2), 
                      density=True, 
                      align='left', 
                      alpha = 0.6, 
                      edgecolor='k'
                    )
    plt.xticks(np.arange(n_rolls+1), fontsize=13)
    plt.yticks(np.linspace(0, 1, 11), fontsize=13)
    plt.title(f'Number of 6s in {n_rolls} rolls over {n_trials:,} trials',
              fontsize=14)
    print("\n".join([f"Number of 6s: {i:3d} | Proportion: {values[0][i]:.6f}" 
                     for i in range(n_rolls+1)]))
    plt.show()
    return
plot_histogram(n_rolls=2)
Number of 6s:   0 | Proportion: 0.695630
Number of 6s:   1 | Proportion: 0.276400
Number of 6s:   2 | Proportion: 0.027970

png

plot_histogram(n_rolls=3)
Number of 6s:   0 | Proportion: 0.580000
Number of 6s:   1 | Proportion: 0.345960
Number of 6s:   2 | Proportion: 0.069400
Number of 6s:   3 | Proportion: 0.004640

png

Assuming that we have regular, fair, six-sided dice, what is the probability of getting exactly five 6s when we roll 10 dice at the same time (or the same die 10 times in a row)?

Hints:

The choose function we introduced earlier gives us the number of ways you can pick k interchangeable items from a collection of N elements.

Using the choose function we introduced earlier, we can write the proportion of rolls where we get exactly 5 sixes. In 10 rolls, we have $6^{10}$ possible outcomes. If we observe 5 sixes, this means that we also have 5 values other than 6. These 5 values come in $5^5$ combinations (5 ways to choose something other than 6 for each of the 5 dice). How many ways are there to choose which 5 dice don't show a 6: "10 choose 5" ways. Hence:

choose(10, 5) * 5**5 / 6 ** 10
0.013023810204237159

Meaning that the probability of observing five 6s in 10 rolls is about 1.3%.

We can check that against our dice roll simulation code.

plot_histogram(n_rolls=10, n_trials=1000000)
Number of 6s:   0 | Proportion: 0.161252
Number of 6s:   1 | Proportion: 0.323586
Number of 6s:   2 | Proportion: 0.290073
Number of 6s:   3 | Proportion: 0.155074
Number of 6s:   4 | Proportion: 0.054466
Number of 6s:   5 | Proportion: 0.013067
Number of 6s:   6 | Proportion: 0.002203
Number of 6s:   7 | Proportion: 0.000253
Number of 6s:   8 | Proportion: 0.000026
Number of 6s:   9 | Proportion: 0.000000
Number of 6s:  10 | Proportion: 0.000000

png

The exact result.

Let's implement the formula for an arbitrary number k of 6s when rolling N fair dice ($k\leqslant N$).

We'll call this function number_of_sixes(N, k).

We’ll use this function to:

  1. check the simulation results in the histogram above.
  2. compute the probability of getting at least five 6s in 10 rolls of a die.
  3. compute the probability of getting at most five 6s in 10 rolls of a die.

def number_of_sixes(N, k):
    return choose(N, k)*5**(N-k)/6**N

for i in range(11):
    print("Probability of {:2d} 6s in 10 rolls: {:.6f}".format(i, number_of_sixes(10, i)))
    
print("\nProbability of at least five 6s in 10 rolls: {:.6f}".format(sum(number_of_sixes(10, k) for k in range(5, 11))))

print("\nProbability of at most five 6s in 10 rolls: {:.6f}".format(sum(number_of_sixes(10, k) for k in range(6))))
Probability of  0 6s in 10 rolls: 0.161506
Probability of  1 6s in 10 rolls: 0.323011
Probability of  2 6s in 10 rolls: 0.290710
Probability of  3 6s in 10 rolls: 0.155045
Probability of  4 6s in 10 rolls: 0.054266
Probability of  5 6s in 10 rolls: 0.013024
Probability of  6 6s in 10 rolls: 0.002171
Probability of  7 6s in 10 rolls: 0.000248
Probability of  8 6s in 10 rolls: 0.000019
Probability of  9 6s in 10 rolls: 0.000001
Probability of 10 6s in 10 rolls: 0.000000

Probability of at least five 6s in 10 rolls: 0.015462

Probability of at most five 6s in 10 rolls: 0.997562

Our simulation results above are in good agreement with these exact values.