That ain't dancing Sally!

Same difference.

Congratulations! Your simulation code from the previous post has impressed all and sundry and you've been asked to teach introductory statistics to first year students at the prestigious Vandelay University.

You've got 2 stats classes, one with a group of 65 students, and another with a group of 35 students. We assume that the students have been randomly allocated to each group, and that they share the same final exam.

The difference between the groups is the teaching technique employed. With group1 you teach a standard stats course, while with group2 you sometimes use interpretive dance instead of equations to explain statistical concepts.


Let's jump right in.

from math import *
import numpy as np

# suppress warning messages.
import warnings
warnings.filterwarnings('ignore')

# import the scipy module which comtains scientific and stats functions.
import scipy.stats as stats

# usual plotting stuff.
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

# set the matplotlib style 
plt.style.use("seaborn-darkgrid")

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Here are the marks (out of 100) in the final exam for these two groups.

group1 = np.array([57, 65, 54, 53, 81, 75, 61, 63, 68, 73, 71, 45, 63, 62, 74, 69, 37,
       55, 78, 55, 54, 65, 95, 48, 53, 61, 63, 65, 60, 74, 52, 50, 33, 39,
       100, 66, 46, 61, 99, 63, 34, 57, 43, 49, 45, 63, 64, 51, 63, 74, 60,
       62, 59, 71, 74, 49, 42, 60, 68, 52, 69, 59, 60, 67, 69])

group2 = np.array([40, 50, 25, 43, 37, 36, 62, 58, 58, 62, 45, 44, 40, 43, 53, 51, 38,
       59, 55, 62, 71, 40, 60, 49, 61, 53, 74, 49, 73, 51, 59, 52, 98, 57,
       52])
print(group1.size, group2.size)
65 35

When you compare the marks between the two groups you notice that, on average, the marks in group2 are lower that those in group1.

diff = group1.mean() - group2.mean()
print(diff)
7.934065934065934

We can visualise the distribution of marks in both groups using a box plot.

fig = plt.figure(figsize=(8, 6))
plt.boxplot((group1, group2))
plt.xlabel("group", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("mark", fontsize=20)
plt.yticks(fontsize=14)
plt.show()

png

Is this difference in means statistically significant (i.e., should you drop the interpretive dance routine)?

Statistics should be able to provide us with an answer. Thankfully, some of my best friends are statisticians. I found one of those magnificent statistical creatures, and I asked them about differences between means of two groups. They mumbled something about beer, students, and tea test or something… I couldn’t tell what they were saying so I went to scipy instead and it turns out there’s an app for that.

The t-test is a statistical test that can tell us whether the difference in means between our two groups is statistically significant.

First, let us observe that the two groups have similar variances (this allows us to run a particular flavour of the test).

group1.var(), group2.var()
(179.0248520710059, 175.55102040816323)

Close enough for us. Let’s run the test.

t, p = stats.ttest_ind(group1, group2, equal_var=True)

print(f'Probability that a difference at least as extreme as {diff:0.2f} is due to chance (t test): {p*100:.2f}%')
Probability that a difference at least as extreme as 7.93 is due to chance (t test): 0.60%
But what does it all mean?

To the simulation!

We are trying to test whether there is a genuine (statistically significant) difference between the two groups.

One way that we can test this is to estimate how likely we are to observe a difference between the means of the two groups of at least 7.93, if we assume that there’s no difference in marks between the two groups (null hypothesis).

We can accomplish that by pooling all the values together and randomly shuffling (assigning) 65 of these values to group1 and the rest to group2.

Using this shuffling scheme, we will get an average difference between the two groups around zero, and the spread of the values we get will tell us how extreme (i.e., unlikely) a value of 7.93 or larger would be under the null hypothesis.

Let’s randomly shuffle the data across the two groups and compute the difference in means 100,000 times. (this may take a moment)

N = 100000

np.random.seed(12345)

# Let's pool the marks together
a = np.concatenate((group1, group2))

i = group1.size

L = []
for _ in range(N):
    # shuffle the data using random permutation (most useless code comment ever!)
    shuffle = np.random.permutation(a)
    
    # split the shuffled data into 2 groups
    grp1, grp2 = shuffle[:i], shuffle[i:]
    
    # compute the difference in means
    L.append(np.mean(grp1) - np.mean(grp2))
    
L = np.array(L)

Let’s plot a histogram of the results.

plt.figure(figsize=(12, 6))
plt.hist(L, bins=50, normed=False, edgecolor='w')
plt.title('Difference between group means', fontsize=18)
plt.axvline(x=diff, ymin=0, ymax=1, color='r')
plt.annotate("Observed\ndifference\n($7.93$)", xy=(8.5, 5000), color='r', fontsize=14)
plt.xlabel("difference in means", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("count", fontsize=20)
plt.yticks(fontsize=14)
plt.show()

png

On the histogram, we see that the observed difference is quite far from the mode of the distribution.

In other words, it appears that a difference of 7.93 or more (in magnitude) does not occur very often. Let’s quantify this.

Proportion of simulated trials where the (absolute value of the) difference exceeds the observed difference.

pSim = np.mean((np.abs(L) > diff))
print(f'Probability that the difference at least as extreme as {diff:0.2f} is due to chance (Simulation): {pSim*100:.2f}%')
Probability that the difference at least as extreme as 7.93 is due to chance (Simulation): 0.58%
This is not too bad considering the true result is 0.60%.

This result means that if we assume that the two groups are sampled from the same population of students, the probability of observing a difference in means of at least 7.93 between the group just by random chance is only around 0.6%.

It is quite typical for the threshold for statistical significance to be set at 5%. Therefore, in this case, we’d conclude that the difference between the two groups is statistically significant. In other words, the teaching method has an impact on the marks. You might want to put that leotard away, stop the gyrations cause that ain’t dancing Sally!

Seinfeld's Magnificent Seven

In this post I want to take a look at a classic data set: the US baby names.

While the origin of many names can be found in popular culture and (what's the alternative to popular culture? impopular culture?) elsewhere, we shall focus on how the TV show Seinfeld may have influenced the naming of hundreds of innocent babies.

We'll roughly follow the steps outlined by Wes McKinney in his excellent book Python for Data Analysis to load the data.

Let's first start by importing a couple of modules.

from math import *
import numpy as np
import pandas as pd

import os

from pathlib import Path

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-white')

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

A. Loading the data

The dataset can be downloaded from http://www.ssa.gov/oact/babynames/limits.html.

(Follow the link and download the dataset in the datasets directory, in a directory called names, and unzip it).

The dataset contains the top 1000 most popular names starting in 1880.

The data for each year is in its own text file. For instance, the data for 1880 is in the file yob1880.txt (which should be located in the datasets/names/ folder.)

The first thing to do is merge the data for all the different years into a single dataframe, however, as a warm up, let’s look at the data for a single year.

input_path = Path('data/names/yob1880.txt')
!head {input_path}
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
names1880 = pd.read_csv(input_path, names=['name', 'sex', 'births'])
names1880.head()
name sex births
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746

Let’s try to automate the loading of all the files.

First we need a list of all the files in our directory.

A Path object has a glob method that returns a generator with the elements within a given directory.

dataset_dir = Path('data/names')
files = [item.name for item in sorted(dataset_dir.glob("*.txt"))]
# Extract a list of years
years = [item[3:7] for item in files]

We gather the data into a single dataset in two steps:

  1. We create a list of dataframes where each element corresponds to a particular year.
  2. We concatenate all the data (why can’t we use merge here?) into a single dataframe.
L = []
for file, year in zip(files, years):
    path = Path(dataset_dir, file)
    df = pd.read_csv(path, names=['name', 'sex', 'births'])
    df['year'] = year
    L.append(df)
    
names = pd.concat(L)    
names.head()
name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

It's always a good idea to check the data types of our features (columns). Don't assume that the data you downloaded is in the right format.

If we fix any issue early, it'll make the analysis a lot easier.

names.dtypes
name      object
sex       object
births     int64
year      object
dtype: object

The year feature is of type object which is Python’s way of telling us it’s got a generic datatype.

We’d like it to be a datetime type so let’s fix that.

Pandas has several features that make manipulating date and timestamp objects easy.

(We’ll also change the type of sex to category while we’re at it)

names['year'] = pd.to_datetime(names['year'])
names['sex'] = names.sex.astype('category')
names.head()
name sex births year
0 Mary F 7065 1880-01-01
1 Anna F 2604 1880-01-01
2 Emma F 2003 1880-01-01
3 Elizabeth F 1939 1880-01-01
4 Minnie F 1746 1880-01-01

Notice how our year column has been changed into a datetime object.

By default, since we only specified the year, the datetime object is set at time 00:00:00 on the 1st of January.

names.dtypes
name              object
sex             category
births             int64
year      datetime64[ns]
dtype: object

B. Exploring the data

So how big is our dataset?

names.count()
name      1924665
sex       1924665
births    1924665
year      1924665
dtype: int64

How is the data split on the sex of the baby?

names.sex.value_counts()
F    1138293
M     786372
Name: sex, dtype: int64

Let’s get percentages.

names.sex.value_counts(normalize=True).apply(lambda x: f'{x*100:.0f}%')
F    59%
M    41%
Name: sex, dtype: object

Let’s find the total number of births for each year, for both sexes.

total_births_per_year_per_sex = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)
total_births_per_year_per_sex.head()
sex F M
year
1880-01-01 90993 110491
1881-01-01 91953 100743
1882-01-01 107847 113686
1883-01-01 112319 104627
1884-01-01 129020 114442

Note:

The pivot_table method is just a convenient shortcut for a groupby operation.

names.groupby(['year', 'sex']).births.sum().unstack().head()
sex F M
year
1880-01-01 90993 110491
1881-01-01 91953 100743
1882-01-01 107847 113686
1883-01-01 112319 104627
1884-01-01 129020 114442
total_births_per_year_per_sex.plot(figsize=(12, 6), 
                                   title="Total births by sex and year", 
                                   lw=3, 
                                   color=('r', 'g'), 
                                   alpha=0.5);

png

How many unique names are there for each year and gender?

unique_names = names.groupby(['year', 'sex'])['name'].nunique().unstack()
unique_names.plot(figsize=(12, 6), title="Unique names by sex and year", lw=3, color=('r', 'g'), alpha=0.5);

png

How many names from 1880 were still used in the most recent year we have, and which ones have fallen into desuetude?

last_year = names.year.dt.year.max()

s1880 = set(names[names.year.dt.year == 1880]['name'].unique())
slast = set(names[names.year.dt.year == last_year]['name'].unique())

print(f"""
There were {len(s1880)} distinct names recorded in 1880
There were {len(slast)} distinct names recorded in {last_year}\n
{len(slast.intersection(s1880))} names were found in both years
{len(s1880.difference(slast))} names found in 1880 were not recorded in {last_year}
{len(slast.difference(s1880))} names found in {last_year} were not recorded in 1880
""")

print(f"The names recorded in 1880 but no longer in use in {last_year} are:\n\n", 
      sorted([item for item in s1880.difference(slast)]))
There were 1889 distinct names recorded in 1880
There were 29910 distinct names recorded in 2017

1501 names were found in both years
388 names found in 1880 were not recorded in 2017
28409 names found in 2017 were not recorded in 1880

The names recorded in 1880 but no longer in use in 2017 are:

 ['Ab', 'Adelbert', 'Adline', 'Adolf', 'Adolphus', 'Agustus', 'Albertina', 'Albertine', 'Albertus', 'Alcide', 'Alda', 'Alf', 'Algie', 'Almeda', 'Almer', 'Almon', 'Alonza', 'Alpheus', 'Altha', 'Alvah', 'Alvena', 'Alverta', 'Alvia', 'Anner', 'Annis', 'Araminta', 'Arch', 'Ardelia', 'Ardella', 'Arminta', 'Arther', 'Arvid', 'Arvilla', 'Asberry', 'Asbury', 'Aurthur', 'Auther', 'Author', 'Authur', 'Babe', 'Ballard', 'Bama', 'Barnett', 'Barney', 'Bart', 'Bedford', 'Bena', 'Benjaman', 'Benjamine', 'Bertie', 'Berton', 'Bertram', 'Bess', 'Besse', 'Bettie', 'Bird', 'Birt', 'Birtha', 'Birtie', 'Blanch', 'Budd', 'Buford', 'Bulah', 'Burley', 'Burr', 'Butler', 'Byrd', 'Caddie', 'Celie', 'Ceylon', 'Chalmers', 'Chancy', 'Charlotta', 'Chas', 'Chin', 'Christena', 'Cicero', 'Cinda', 'Clarance', 'Clarinda', 'Classie', 'Claud', 'Claudie', 'Claudine', 'Claus', 'Clemie', 'Clemmie', 'Cloyd', 'Clyda', 'Colonel', 'Commodore', 'Concepcion', 'Corda', 'Cordella', 'Cordia', 'Cordie', 'Corine', 'Cornelious', 'Creola', 'Dee', 'Dellar', 'Dewitt', 'Dicie', 'Dick', 'Dillard', 'Dillie', 'Docia', 'Dock', 'Doctor', 'Dolphus', 'Donie', 'Dosha', 'Doshia', 'Doshie', 'Dow', 'Drucilla', 'Drusilla', 'Duff', 'Earle', 'Early', 'Edd', 'Edmonia', 'Ednah', 'Edyth', 'Effa', 'Eldora', 'Electa', 'Elick', 'Eliga', 'Eligah', 'Elige', 'Elizebeth', 'Ellar', 'Elmire', 'Elmo', 'Elmore', 'Elzie', 'Emmer', 'Eola', 'Ephriam', 'Erasmus', 'Erastus', 'Ernst', 'Estell', 'Estie', 'Etha', 'Ethelyn', 'Ethyl', 'Etna', 'Etter', 'Fayette', 'Ferd', 'Finis', 'Firman', 'Flem', 'Fleming', 'Florance', 'Florida', 'Flossie', 'Floy', 'Freda', 'Friend', 'Frona', 'Fronie', 'Fronnie', 'Garfield', 'Gee', 'Gorge', 'Gottlieb', 'Green', 'Guss', 'Gussie', 'Gust', 'Gustavus', 'Hamp', 'Handy', 'Hardie', 'Harl', 'Harrie', 'Harriette', 'Harve', 'Hassie', 'Hedwig', 'Helma', 'Hence', 'Hermann', 'Hermine', 'Hessie', 'Hester', 'Hettie', 'Hillard', 'Hilliard', 'Hilma', 'Holmes', 'Hortense', 'Hosteen', 'Hoy', 'Hulda', 'Huldah', 'Hyman', 'Icy', 'Idell', 'Irven', 'Isham', 'Izora', 'Jennette', 'Josiephine', 'Junious', 'Kittie', 'Kizzie', 'Knute', 'Lafe', 'Lavenia', 'Lavonia', 'Lawyer', 'Leatha', 'Leitha', 'Lem', 'Lemma', 'Lessie', 'Letha', 'Letitia', 'Letta', 'Lew', 'Lewie', 'Liddie', 'Lidie', 'Lige', 'Liller', 'Lillis', 'Lissie', 'Littie', 'Lollie', 'Loma', 'Lon', 'Lonie', 'Lotta', 'Louetta', 'Loula', 'Louvenia', 'Lovisa', 'Ludie', 'Lue', 'Lugenia', 'Lular', 'Lulie', 'Lum', 'Lutie', 'Luvenia', 'Madge', 'Madie', 'Madora', 'Mal', 'Malvina', 'Mammie', 'Manda', 'Manerva', 'Manervia', 'Manford', 'Manie', 'Manley', 'Manly', 'Margaretta', 'Margeret', 'Mart', 'Mat', 'Math', 'Matie', 'Maymie', 'Media', 'Melville', 'Mertie', 'Merton', 'Meta', 'Mettie', 'Mignon', 'Milford', 'Mima', 'Minda', 'Miner', 'Minervia', 'Minor', 'Minta', 'Mintie', 'Missouri', 'Mittie', 'Mont', 'Mortimer', 'Morton', 'Mozella', 'Myrta', 'Myrtie', 'Myrtle', 'Nan', 'Nannie', 'Nat', 'Nealie', 'Neppie', 'Nimrod', 'Nolie', 'Nonie', 'Norval', 'Ocie', 'Octave', 'Oda', 'Olie', 'Omie', 'Onie', 'Oral', 'Orie', 'Orilla', 'Orley', 'Orpha', 'Orrie', 'Orval', 'Osie', 'Ossie', 'Otelia', 'Otha', 'Otho', 'Ottie', 'Pansy', 'Paralee', 'Pat', 'Pattie', 'Pearlie', 'Perley', 'Permelia', 'Pink', 'Pinkey', 'Pinkie', 'Pinkney', 'Pleas', 'Pleasant', 'Ples', 'Pollie', 'Press', 'Redden', 'Rella', 'Reta', 'Rice', 'Rillie', 'Roena', 'Rolla', 'Rosia', 'Rube', 'Sannie', 'Sim', 'Squire', 'Sudie', 'Sybilla', 'Sylvanus', 'Tella', 'Tempie', 'Tennie', 'Teressa', 'Theda', 'Theophile', 'Thursa', 'Tilmon', 'Tishie', 'Tobe', 'Ula', 'Vannie', 'Venie', 'Verdie', 'Vern', 'Verne', 'Vernie', 'Vertie', 'Viney', 'Virgie', 'Volney', 'Wash', 'Watt', 'Weaver', 'Wilburn', 'Wilda', 'Wilhelmine', 'Willam', 'Wm', 'Wong', 'Wood', 'Woodie', 'Worthy', 'Yee', 'Yetta', 'Zilpha']

C. Data Exploration

Let's extract the names that were given to both male and female babies and put it in a list named both.

s1 = set(names[names.sex == 'F']['name'])
s2 = set(names[names.sex == 'M']['name'])

both = list(s1.intersection(s2))
print(f"{len(both)} names in our dataset were given to both boys and girls. \n\n Here's a sample:\n")

print(" | ".join(list(both)[:100]))
10663 names in our dataset were given to both boys and girls. 

 Here's a sample:

Ovis | Honor | Honore | Brandyn | Dorcus | Michel | Million | Donavan | Kemonie | Alie | Shalom | Sagan | Maeson | Kharis | Tenny | Daron | Madeline | Torey | Dandy | Nasir | Lavorn | Tracey | Ali | Kanya | Huey | Hershel | Kiren | Unnamed | Mose | Dakari | Navid | Krystian | Koi | Giavonni | Justine | Kamar | Derrell | Rajdeep | Leyton | Mong | Seattle | Vader | Anias | Camaree | Zimri | Khamauri | Jireh | Choyce | Cleo | Marki | Keontay | Shakari | Amarie | Jrue | Refugia | Luster | Levorn | Dillon | Joie | Dalas | Son | Graham | Samie | Larson | Kylynn | Kearney | Dejah | Teegan | Jahzel | Brunell | Delijah | Asiyah | Taelin | Rafa | Edward | Essie | Kenzie | Kwan | Honour | Britney | Shirl | Dara | Berthal | Keymoni | Madelyn | Dorey | Trinell | Maury | Skyylar | Courtnay | Lyncoln | Modie | Octavious | Jaice | Snow | Domique | Eva | Jeremy | Kobee | Sae

Upon a more careful examination of the whole output, some interesting observations can be made.

For instance I would expect names like Lucille to be only given to girls (or B.B. King's guitar), but according to our result, these names were also given to boys.

On the other hand, while to me the name Basil may evoke a cantankerous hotel manager in Torquay, apparently, some parents thought it the perfect moniker for their baby girl. Similarly, the name Sylvester was given to both boys and girls.

Let's have a closer look.

names[names.name == 'Lucille'].head()
name sex births year
248 Lucille F 40 1880-01-01
223 Lucille F 48 1881-01-01
170 Lucille F 85 1882-01-01
220 Lucille F 66 1883-01-01
188 Lucille F 94 1884-01-01

In how many years does the name Lucille appear for either boys or girls?

names[names.name == 'Lucille']['sex'].value_counts()
F    138
M     49
Name: sex, dtype: int64

Note:

Note that this is not the number of babies of each sex with that name.

We've grouped the data by year and sex so this count corresponds to the number of years in which at least five babies of either gender were named Lucille (5 being the yearly threshold beyond which a name isn't recorded in the data set).

We can compute the same information for other names.

names[names.name == 'Basil']['sex'].value_counts()
M    138
F     13
Name: sex, dtype: int64
names[names.name == 'Sylvester']['sex'].value_counts()
M    138
F     62
Name: sex, dtype: int64
names[names.name == 'Hannah']['sex'].value_counts()
F    138
M     38
Name: sex, dtype: int64

Let’s get the total number of babies named Basil for any given year?

basil = names[names.name == 'Basil'].pivot_table('births', 
                                                 index='year', 
                                                 columns='sex', 
                                                 fill_value=0, 
                                                 aggfunc=np.sum )
basil.head()
sex F M
year
1880-01-01 0 11
1881-01-01 0 9
1882-01-01 0 12
1883-01-01 0 8
1884-01-01 0 11

Let’s filter out the years where no girl was named Basil.

basil[basil.F != 0]
sex F M
year
1919-01-01 7 262
1927-01-01 5 249
2006-01-01 7 64
2008-01-01 5 31
2009-01-01 7 56
2010-01-01 11 47
2011-01-01 7 44
2012-01-01 20 45
2013-01-01 21 56
2014-01-01 17 44
2015-01-01 16 52
2016-01-01 22 60
2017-01-01 24 58

Note that we have 13 years, in agreement with the output of value_counts above.

Let’s visualise this for a couple of names. I’m using a log scale because otherwise only the data for the dominant gender would be visible.

basil.plot(figsize=(12, 6), logy=True, marker='o', lw=0, 
           color=('red', 'green'), title='Basil', 
           alpha=0.5);

png

We can daisy chain the name selection and the plotting.

(names[names.name == 'Christina']
 .pivot_table('births', 
              index='year', 
              columns='sex', 
              fill_value=0, 
              aggfunc='sum')
 .plot(figsize=(12, 6),
       logy=True, marker='o', lw=0,
       color=('r', 'g'), 
       title='Christina',
       alpha=0.5)
);

png

Let’s put that code into a function to make it more convenient and follow the DRY principle.

def timeline_name(name='Basil', use_log=True):
    ax = (names[names.name == name]
            .pivot_table('births', 
                         index='year', 
                         columns='sex', 
                         fill_value=0, 
                         aggfunc=np.sum)
            .plot(figsize=(10, 6),
                  logy=use_log, 
                  marker='o', 
                  lw=1,
                  color=('r', 'g'), 
                  alpha=0.5)
    )
    ax.set_xlabel('year', fontsize=14)
    ax.set_ylabel('count', fontsize=14)
    ax.set_title(name, fontsize=16)
timeline_name('Lucille')

png

timeline_name('Hannah')

png

D. The Seinfeld effect?

Note that a plausible explanation of why the dominant gender for a given name may have changed over the years can sometimes be found relatively easily, as in the case of the name Ashley...

For other names, a change in popularity may be due to a less conventional reason...

In an episode of Seinfeld George mentions to his friends that should he ever have a kid, he'd name him (or her) Seven, after the jersey number of Yankee baseball player Mickey Mantle.

First, let’s check whether “Seven” is in our data set.

'Seven' in names.name.unique()
True

Giddy up!

The Seinfeld episode was aired in 1996.

Let’s see if there’s any indication that it may have influenced the popularity of the name “Seven”.

First let’s create a date range over which to plot the data.

seinfeldYear = '1996'
start = '1991'
end   = '2018'

# Let's create a range of dates from start to end.
date_range = pd.date_range(start, end, freq='A-JAN')
date_range
DatetimeIndex(['1991-01-31', '1992-01-31', '1993-01-31', '1994-01-31',
               '1995-01-31', '1996-01-31', '1997-01-31', '1998-01-31',
               '1999-01-31', '2000-01-31', '2001-01-31', '2002-01-31',
               '2003-01-31', '2004-01-31', '2005-01-31', '2006-01-31',
               '2007-01-31', '2008-01-31', '2009-01-31', '2010-01-31',
               '2011-01-31', '2012-01-31', '2013-01-31', '2014-01-31',
               '2015-01-31', '2016-01-31', '2017-01-31'],
              dtype='datetime64[ns]', freq='A-JAN')

Let’s visualise the trend for “Seven”.

# The data
data = (names[names.name == 'Seven']
            .pivot_table('births', 
                         index='year', 
                         columns='sex', 
                         fill_value=0, 
                         aggfunc=np.sum)
       )

# The base plot
ax = data.plot(figsize=(12, 8), 
          logy=False, 
          marker='o', 
          color=('r', 'g'), 
          alpha=0.5, 
          title='Seven', 
          grid=False)

# The vertical strip
ax.axvline(seinfeldYear, ymin=0, ymax=1000, lw=15, color='orange', alpha=0.5)
ax.set_xticks(date_range)
ax.set_xlim([start, end])

# Ensure that the labels on the x axis match the years.
ax.set_xticklabels(date_range.year, rotation=0, ha='center' )

# Annotate the figure with some text by specifying location using xy parameter
ax.annotate('Episode "The Seven" is aired', 
            xy=(pd.to_datetime(seinfeldYear, format="%Y"), 100),
            xycoords='data',
            rotation=90,
            horizontalalignment='center'
            )

plt.show()

png

Note that this does not prove that the show "created" the name fad (correlation ≠ causation and all that; there could be another reason behind both the increase in popularity of the name, and the use of the name in the show), but it does seem to indicate that it enhanced it...

Poor kids! Serenity now!

E. Mom & Pop culture

While we’re on this topic, can we find other possible influences of pop culture in the data set?

Let’s create a general function that we can use to plot various trends.

def plot_name_event(data = names, name="Seven",
                    year='1996', start='1968', end='2013', 
                    event = '', freq='A-JAN'):    
    
    date_range   = pd.date_range(start, end, freq=freq)
    
    data = (names[names.name == name]
            .pivot_table('births', 
                         index='year', 
                         columns='sex', 
                         fill_value=0, 
                         aggfunc='sum')
           )
    ax = data.plot(figsize=(14, 6), 
                   logy=False, 
                   marker='o', 
                   color=('r', 'g'), 
                   alpha=0.5, 
                   title = f"Name: {name} | {event} in {year}", 
                   grid=False)
    
    ax.axvline(year, ymin=0, ymax=data.max().max(), lw=15, color='orange', alpha=0.5)
    ax.set_xticks(date_range)
    ax.set_xlim([start, end])
    ax.set_ylim([0, data.loc[data.index <= date_range[-1]].max().max()*1.1])

    ax.set_xticklabels(date_range.year, rotation=90, ha='center' )
    plt.show()

First, let’s look at a few movie characters/actors.

plot_name_event(name="Neo", year='1999', start='1990', 
                event='The movie The Matrix is released')

png

plot_name_event(name="Trinity", year='1999', start='1990', 
                event='The movie The Matrix is released')

png

How about further back in time?

Errol Flynn got his big break in Captain Blood back in 1935. Let’s see if all that swashbuckling influenced parents when it came to naming their progeny.

plot_name_event(name="Errol",  year='1935', start='1931', end='1961', 
                event='The movie Captain Blood is released')

png

Another possible Hollywood influence around the same time.

plot_name_event(name="Hedy",  year='1938', start='1915', end='1971', 
                event="The actress Hedy Lamarr made her Hollywood debut")

png

Earlier still

plot_name_event(name="Greta",  year='1925', start='1910', end='1981', 
                event="The actress Greta Garbo made her Hollywood debut")

png

Of course, we can’t talk about movies without talking about Star Wars.

Let’s see if we can track the names of some of the characters.

plot_name_event(name="Leia",  year='1977', start='1970', end='1995', 
                event='The movie Star Wars is released')

png

plot_name_event(name="Han",  year='1977', start='1970', end='2015', 
                event='The movie Star Wars is released')

png

plot_name_event(name="Lando",  year='1977', start='1970', end='2015', 
                event='The movie Star Wars is released')

png

Hmmmm… A bit surprising.

What about other characters from the trilogy?

Fortunately (for the kids) there doesn’t seem to be any Jabba or Jar Jar in our list… (TBH, I’m mildly disappointed).

characters = {"jabba", "yoda", "chewbacca", "jar jar"}

set(names.name.str.lower().unique()) & characters
set()

Let’s look at the main character of another popular movie series.

plot_name_event(name="Indiana",  year='1981', start='1970', end='2015', 
                event="The movie Raiders Of The Lost Ark was released")

png

While on this topic…

plot_name_event(name="Harrison",  year='1981', start='1975', end='2015', 
                event="The movie Raiders Of The Lost Ark was released")

png

In a different genre we also have:

plot_name_event(name="Clint",  year='1966', start='1950', end='1981', 
                event="The movie The Good, the Bad, and the Ugly was released")

png

We’ve focused on movies, but of course, popular culture’s influence on baby names isn’t limited to movies or television.

Songs or singers can also contribute to the popularity of a given name.

Here are a couple of examples.

plot_name_event(name="Elvis",  year='1955', start='1945', end='1991', 
                event="Radios start playing Elvis Presley's songs")

png

plot_name_event(name="Michelle",  year='1965', start='1960', end='1981', 
                event='The song Michelle is released by The Beatles')

png

plot_name_event(name="Jermaine",  year='1970', start='1960', end='1990', 
                event="The Jackson 5 top the charts")

png

… And a couple of more recent ones

plot_name_event(name="Beyonce",  year='1998', start='1995', end='2015', 
                event="The group Destiny's Child released its debut album")

png

Same with sport.

plot_name_event(name="Kobe",  year='1996', start='1988', end='2015',
                event="NBA player Kobe Bryant made his debut in the league")

png

I’ll stop here, but I’m sure many other interesting patterns can be found in this data set…

As I mentioned earlier, the fact that a name’s increase in popularity coincides with a particular event isn’t enough to demonstrate a causal relationship, however for some of these examples the coincidence is certainly interesting.

In any case, I’d like to think that hundreds, if not thousands, of people owe their moniker to George Costanza.

A sobering thought.

That’s ok. It could have been worse…

"Mulva" in names.name.unique()
False

I had to check!