Posted on August 29, 2018
| 5 minutes
| 994 words
| Adel Rahmani
Same difference.
Congratulations! Your simulation code from the previous post has impressed all and sundry and you've been asked to teach introductory statistics to first year students at the prestigious Vandelay University.
You've got 2 stats classes, one with a group of 65 students, and another with a group of 35 students.
We assume that the students have been randomly allocated to each group, and that they share the same
final exam.
The difference between the groups is the teaching technique employed. With group1 you teach a standard stats course,
while with group2 you sometimes use interpretive dance instead of equations to explain statistical concepts.
Let's jump right in.
frommathimport*importnumpyasnp# suppress warning messages.importwarningswarnings.filterwarnings('ignore')# import the scipy module which comtains scientific and stats functions.importscipy.statsasstats# usual plotting stuff.importmatplotlib.pyplotaspltfrommatplotlibimportcm%matplotlibinline# set the matplotlib style plt.style.use("seaborn-darkgrid")fromIPython.core.displayimportdisplay,HTMLdisplay(HTML("<style>.container { width:100% !important; }</style>"))
Here are the marks (out of 100) in the final exam for these two groups.
Is this difference in means statistically significant (i.e., should you drop the interpretive dance routine)?
Statistics should be able to provide us with an answer. Thankfully, some of my best friends are statisticians.
I found one of those magnificent statistical creatures, and I asked them about differences between means of two groups.
They mumbled something about beer, students, and tea test or something… I couldn’t tell what they were saying so I went to scipy instead and
it turns out there’s an app for that.
The t-test is a statistical test that can tell us whether the difference in means between our two groups is statistically significant.
First, let us observe that the two groups have similar variances (this allows us to run a particular flavour of the test).
group1.var(),group2.var()
(179.0248520710059, 175.55102040816323)
Close enough for us. Let’s run the test.
t,p=stats.ttest_ind(group1,group2,equal_var=True)print(f'Probability that a difference at least as extreme as {diff:0.2f} is due to chance (t test): {p*100:.2f}%')
Probability that a difference at least as extreme as 7.93 is due to chance (t test): 0.60%
But what does it all mean?
To the simulation!
We are trying to test whether there is a genuine (statistically significant) difference between the two groups.
One way that we can test this is to estimate how likely we are to observe a difference between the means of the two groups of at least 7.93, if we assume that there’s no difference in marks between the two groups (null hypothesis).
We can accomplish that by pooling all the values together and randomly shuffling (assigning) 65 of these values to group1 and the rest to group2.
Using this shuffling scheme, we will get an average difference between the two groups around zero, and the spread of the values we get will tell us how extreme (i.e., unlikely) a value of 7.93 or larger would be under the null hypothesis.
Let’s randomly shuffle the data across the two groups and compute the difference in means 100,000 times.
(this may take a moment)
N=100000np.random.seed(12345)# Let's pool the marks togethera=np.concatenate((group1,group2))i=group1.sizeL=[]for_inrange(N):# shuffle the data using random permutation (most useless code comment ever!)shuffle=np.random.permutation(a)# split the shuffled data into 2 groupsgrp1,grp2=shuffle[:i],shuffle[i:]# compute the difference in meansL.append(np.mean(grp1)-np.mean(grp2))L=np.array(L)
Let’s plot a histogram of the results.
plt.figure(figsize=(12,6))plt.hist(L,bins=50,normed=False,edgecolor='w')plt.title('Difference between group means',fontsize=18)plt.axvline(x=diff,ymin=0,ymax=1,color='r')plt.annotate("Observed\ndifference\n($7.93$)",xy=(8.5,5000),color='r',fontsize=14)plt.xlabel("difference in means",fontsize=20)plt.xticks(fontsize=14)plt.ylabel("count",fontsize=20)plt.yticks(fontsize=14)plt.show()
On the histogram, we see that the observed difference is quite far from the mode of the distribution.
In other words, it appears that a difference of 7.93 or more (in magnitude) does not occur very often. Let’s quantify this.
Proportion of simulated trials where the (absolute value of the) difference exceeds the observed difference.
pSim=np.mean((np.abs(L)>diff))print(f'Probability that the difference at least as extreme as {diff:0.2f} is due to chance (Simulation): {pSim*100:.2f}%')
Probability that the difference at least as extreme as 7.93 is due to chance (Simulation): 0.58%
This is not too bad considering the true result is 0.60%.
This result means that if we assume that the two groups are sampled from the same population of students, the probability of observing a difference in means of at least 7.93 between the group just by random chance is only around 0.6%.
It is quite typical for the threshold for statistical significance to be set at 5%. Therefore, in this case,
we’d conclude that the difference between the two groups is statistically significant. In other words, the teaching
method has an impact on the marks. You might want to put that leotard away, stop the gyrations cause that ain’t dancing Sally!
Posted on May 10, 2018
| 15 minutes
| 3136 words
| Adel Rahmani
In this post I want to take a look at a classic data set: the US baby names.
While the origin of many names can be found in popular culture and (what's the alternative to popular culture?
impopular culture?) elsewhere, we shall focus on how the TV show Seinfeld may have influenced the naming of hundreds of innocent babies.
We'll roughly follow the steps outlined by Wes McKinney in his excellent book Python for Data Analysis to load the data.
Let's first start by importing a couple of modules.
(Follow the link and download the dataset
in the datasets directory, in a directory called names, and unzip it).
The dataset contains the top 1000 most popular names starting in 1880.
The data for each year is in its own text file. For instance, the data for 1880 is in the file yob1880.txt (which should
be located in the datasets/names/ folder.)
The first thing to do is merge the data for all the different years into a single dataframe, however, as a warm up, let’s look at the data for a single year.
unique_names.plot(figsize=(12,6),title="Unique names by sex and year",lw=3,color=('r','g'),alpha=0.5);
How many names from 1880 were still used in the most recent year we have, and which ones have fallen into desuetude?
last_year=names.year.dt.year.max()s1880=set(names[names.year.dt.year==1880]['name'].unique())slast=set(names[names.year.dt.year==last_year]['name'].unique())print(f"""
There were {len(s1880)} distinct names recorded in 1880
There were {len(slast)} distinct names recorded in {last_year}\n{len(slast.intersection(s1880))} names were found in both years
{len(s1880.difference(slast))} names found in 1880 were not recorded in {last_year}
{len(slast.difference(s1880))} names found in {last_year} were not recorded in 1880
""")print(f"The names recorded in 1880 but no longer in use in {last_year} are:\n\n",sorted([itemforitemins1880.difference(slast)]))
Upon a more careful examination of the whole output, some interesting observations can be made.
For instance I would expect names like Lucille to be only given to girls (or B.B. King's guitar), but according to our result, these names were also given to boys.
On the other hand, while to me the name Basil may evoke a cantankerous hotel manager in Torquay, apparently, some parents thought it the perfect moniker for their baby girl. Similarly, the name Sylvester was given to both boys and girls.
Let's have a closer look.
names[names.name=='Lucille'].head()
name
sex
births
year
248
Lucille
F
40
1880-01-01
223
Lucille
F
48
1881-01-01
170
Lucille
F
85
1882-01-01
220
Lucille
F
66
1883-01-01
188
Lucille
F
94
1884-01-01
In how many years does the name Lucille appear for either boys or girls?
Note that this is not the number of babies of each sex with that name.
We've grouped the data by year and sex so this count corresponds to the number of years in which at least five babies of either gender were named Lucille (5 being the yearly threshold beyond which a name isn't recorded in the data set).
We can compute the same information for other names.
Note that a plausible explanation of why the dominant gender for a given name may have changed over the years can sometimes be found relatively easily, as in the case of the name Ashley...
For other names, a change in popularity may be due to a less conventional reason...
In an episode of Seinfeld George mentions to his friends that should he ever have a kid, he'd name him (or her) Seven, after the jersey number of Yankee baseball player Mickey Mantle.
First, let’s check whether “Seven” is in our data set.
'Seven'innames.name.unique()
True
Giddy up!
The Seinfeld episode was aired in 1996.
Let’s see if there’s any indication that it may have influenced the popularity of the name “Seven”.
First let’s create a date range over which to plot the data.
seinfeldYear='1996'start='1991'end='2018'# Let's create a range of dates from start to end.date_range=pd.date_range(start,end,freq='A-JAN')
# The datadata=(names[names.name=='Seven'].pivot_table('births',index='year',columns='sex',fill_value=0,aggfunc=np.sum))# The base plotax=data.plot(figsize=(12,8),logy=False,marker='o',color=('r','g'),alpha=0.5,title='Seven',grid=False)# The vertical stripax.axvline(seinfeldYear,ymin=0,ymax=1000,lw=15,color='orange',alpha=0.5)ax.set_xticks(date_range)ax.set_xlim([start,end])# Ensure that the labels on the x axis match the years.ax.set_xticklabels(date_range.year,rotation=0,ha='center')# Annotate the figure with some text by specifying location using xy parameterax.annotate('Episode "The Seven" is aired',xy=(pd.to_datetime(seinfeldYear,format="%Y"),100),xycoords='data',rotation=90,horizontalalignment='center')plt.show()
Note that this does not prove that the show "created" the name fad (correlation ≠ causation and all that; there could be another reason behind both the increase in popularity of the name, and the use of the name in the show), but it does seem to indicate that it enhanced it...
Poor kids! Serenity now!
E. Mom & Pop culture
While we’re on this topic, can we find other possible influences of pop culture in the data set?
Let’s create a general function that we can use to plot various trends.
defplot_name_event(data=names,name="Seven",year='1996',start='1968',end='2013',event='',freq='A-JAN'):date_range=pd.date_range(start,end,freq=freq)data=(names[names.name==name].pivot_table('births',index='year',columns='sex',fill_value=0,aggfunc='sum'))ax=data.plot(figsize=(14,6),logy=False,marker='o',color=('r','g'),alpha=0.5,title=f"Name: {name} | {event} in {year}",grid=False)ax.axvline(year,ymin=0,ymax=data.max().max(),lw=15,color='orange',alpha=0.5)ax.set_xticks(date_range)ax.set_xlim([start,end])ax.set_ylim([0,data.loc[data.index<=date_range[-1]].max().max()*1.1])ax.set_xticklabels(date_range.year,rotation=90,ha='center')plt.show()
First, let’s look at a few movie characters/actors.
plot_name_event(name="Neo",year='1999',start='1990',event='The movie The Matrix is released')
plot_name_event(name="Trinity",year='1999',start='1990',event='The movie The Matrix is released')
How about further back in time?
Errol Flynn got his big break in Captain Blood back in 1935. Let’s see if all that swashbuckling influenced parents when it came to naming their progeny.
plot_name_event(name="Errol",year='1935',start='1931',end='1961',event='The movie Captain Blood is released')
Another possible Hollywood influence around the same time.
plot_name_event(name="Hedy",year='1938',start='1915',end='1971',event="The actress Hedy Lamarr made her Hollywood debut")
Earlier still
plot_name_event(name="Greta",year='1925',start='1910',end='1981',event="The actress Greta Garbo made her Hollywood debut")
Of course, we can’t talk about movies without talking about Star Wars.
Let’s see if we can track the names of some of the characters.
plot_name_event(name="Leia",year='1977',start='1970',end='1995',event='The movie Star Wars is released')
plot_name_event(name="Han",year='1977',start='1970',end='2015',event='The movie Star Wars is released')
plot_name_event(name="Lando",year='1977',start='1970',end='2015',event='The movie Star Wars is released')
Hmmmm… A bit surprising.
What about other characters from the trilogy?
Fortunately (for the kids) there doesn’t seem to be any Jabba or Jar Jar in our list… (TBH, I’m mildly disappointed).
Let’s look at the main character of another popular movie series.
plot_name_event(name="Indiana",year='1981',start='1970',end='2015',event="The movie Raiders Of The Lost Ark was released")
While on this topic…
plot_name_event(name="Harrison",year='1981',start='1975',end='2015',event="The movie Raiders Of The Lost Ark was released")
In a different genre we also have:
plot_name_event(name="Clint",year='1966',start='1950',end='1981',event="The movie The Good, the Bad, and the Ugly was released")
We’ve focused on movies, but of course, popular culture’s influence on baby names isn’t limited to movies or television.
Songs or singers can also contribute to the popularity of a given name.
Here are a couple of examples.
plot_name_event(name="Elvis",year='1955',start='1945',end='1991',event="Radios start playing Elvis Presley's songs")
plot_name_event(name="Michelle",year='1965',start='1960',end='1981',event='The song Michelle is released by The Beatles')
plot_name_event(name="Jermaine",year='1970',start='1960',end='1990',event="The Jackson 5 top the charts")
… And a couple of more recent ones
plot_name_event(name="Beyonce",year='1998',start='1995',end='2015',event="The group Destiny's Child released its debut album")
Same with sport.
plot_name_event(name="Kobe",year='1996',start='1988',end='2015',event="NBA player Kobe Bryant made his debut in the league")
I’ll stop here, but I’m sure many other interesting patterns can be found in this data set…
As I mentioned earlier, the fact that a name’s increase in popularity coincides with a particular event isn’t enough to demonstrate a causal relationship, however for some of these examples the coincidence is certainly interesting.
In any case, I’d like to think that hundreds, if not thousands, of people owe their moniker to George Costanza.