Polyglot Data Wrangling

Posted on March 28, 2019 | 40 minutes | 8479 words | Adel Rahmani

SQL vs Python vs R

In this post we will look a how basic data wrangling operations can be performed in 3 languages commonly used in data science:

Structured Query Language (SQL)
Python
R

Note that while Python and R have obvious similarities in their approach to data wrangling, SQL is designed to work with relational data, where most of the time consuming operations are (ideally) performed on the database side rather than in the client’s memory. Both Python and R can process data in chunks, or even work with huge, distributed data sets through libraries like Dask (Python only) or Apache Spark (Python, R, Scala, Java). However, we’re mostly interested in illustrating some basic data wrangling pipelines in all three languages, so we’ll be using a small database that can easily fit in memory.

The Jupyter notebook allows us to run all our examples in one place. We’ll run an IPython kernel as our base platform and use an IPython magic command to run our R code within the same environment, and we’ll use sqlalchemy to run queries on a postgreSQL data base.

We won’t be covering the basics of each language in this post. There are many excellent resources online. For data wrangling the Python Data Science Handbook is an outstanding book. For R it’s hard to get past the wonderful R for Data Science by Garrett Grolemund and Hadley Wickham. For SQL there are more resources than we can mention, especially once we factor in the various vendor-specific flavours of SQL. One of my personal favourites is SQL Fundamentals for BI by Jeffrey James. Our SQL examples are based on some of the contents of that course.

We start by loading the Python required libraries (we’ll do all our data wrangling using the awesome pandas module) and create a connection to the PostgreSQL database using sqlalchemy. For our examples we’ll be using the well-known dvdrental database which contains 15 tables representing various aspects of a DVD rental business.

# Connect to a SQL (PostgreSQL) database
from sqlalchemy import create_engine
db = 'postgres:///dvdrental'
con = create_engine(db, echo=False)

# Import pandas for data wrangling in python
import pandas as pd
import warnings
warnings.simplefilter('ignore')

# use ipython magic command to load R into our environment.
%load_ext rpy2.ipython

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

All our R code will be preceded by %%R command to run it on the R interpreter from within our Jupyter notebook. We start by loading the tidyverse library (which is in fact a collection of libraries), and setting an option to improve how some outputs are rendered in the notebook.

%%R
library(tidyverse)
options(crayon.enabled = FALSE)

Because we’re running on top of an ipython kernel, we’ll write our SQL queries as strings and use pandas to run them on our database via sqlalchemy.

The data

First, let’s see what tables are in our DVD rental database.

q = '''
SELECT * FROM pg_catalog.pg_tables;
'''

df = pd.read_sql_query(q, con)
tables = df[~(df.tablename.str.startswith('pg_')|df.tablename.str.startswith('sql_'))].tablename.values
print(tables)

['category' 'country' 'film_category' 'language' 'inventory' 'actor'
 'staff' 'payment' 'rental' 'store' 'city' 'film' 'address' 'film_actor'
 'customer']

For convenience, let’s create a python dictionary with all the tables as dataframes.

Obviously, this means that we’re loading all the tables into memory, which is not what one would typically want to do with a SQL database, however, since our database is small, and we’re only interested in comparing the wrangling syntax across different languages, this is fine for our purpose.

tbl = {table: pd.read_sql_query(f'SELECT * FROM {table};', 
                                con, parse_dates=['payment_date']) 
       for table in tables}

Let’s look at a few rows of the payment table.

payment = tbl['payment']
payment.head()

	payment_id	customer_id	staff_id	rental_id	amount	payment_date
0	17503	341	2	1520	7.99	2007-02-15 22:25:46.996577
1	17504	341	1	1778	1.99	2007-02-16 17:23:14.996577
2	17505	341	1	1849	7.99	2007-02-16 22:41:45.996577
3	17506	341	2	2829	2.99	2007-02-19 19:39:56.996577
4	17507	341	2	3130	7.99	2007-02-20 17:31:48.996577

We can “pass” the payment dataframe to R in the following way:

%%R -i payment
head(payment)

  payment_id customer_id staff_id rental_id amount        payment_date
0      17503         341        2      1520   7.99 2007-02-15 22:25:46
1      17504         341        1      1778   1.99 2007-02-16 17:23:14
2      17505         341        1      1849   7.99 2007-02-16 22:41:45
3      17506         341        2      2829   2.99 2007-02-19 19:39:56
4      17507         341        2      3130   7.99 2007-02-20 17:31:48
5      17508         341        1      3382   5.99 2007-02-21 12:33:49

Now that we’ve got all the tables loaded, let’s run a few queries.

Let’s start with something simple.

Which movie rentals were below a dollar for customers with an id above 500?

▶ SQL

The query is nothing fancy here. We’re selecting a few features (columns) from the payment table, filtering on the dollar amount and customer id before ordering the rows.

q = '''
SELECT 
    p.payment_id, p.customer_id, p.amount 
FROM 
    payment p
WHERE 
    p.amount < 1 AND p.customer_id > 500
ORDER BY 1 ASC, 2 DESC, 3 ASC;
'''
df_sql = pd.read_sql_query(q, con)     # run the query on the database
print(df_sql.shape)                    # output the shape of the resulting table
df_sql.head()                          # display the first few rows

(413, 3)

	payment_id	customer_id	amount
0	18121	502	0.99
1	18122	502	0.99
2	18125	503	0.99
3	18134	506	0.99
4	18144	510	0.99

▶ Python

Pandas allow us to chain methods on a dataframe, however, writing everything on a single line is cumbersome for long pipelines.

We can write each step of the pipeline on its own line by wrapping the pipeline in brackets.

df = (payment[(payment.amount < 1) & (payment.customer_id > 500)]  # select rows based on conditions
         .loc[:, ['payment_id','customer_id','amount']]            # select 3 columns
         .sort_values(by=['payment_id', 'customer_id', 'amount'],  # sort the rows according to the values  
                      ascending=[True, False, True])               #      in the columns
         .reset_index(drop=True)                                   # not really required but looks nicer
        )
print(df.shape)
df.head()

(413, 3)

	payment_id	customer_id	amount
0	18121	502	0.99
1	18122	502	0.99
2	18125	503	0.99
3	18134	506	0.99
4	18144	510	0.99

▶ R

For this simple query the R code is similar to the Python one. The pipe symbol %>% is equivalent to the dot symbol in pandas which allows us to chain methods together.

The %%R -o df_r part of the code is a way to transfer the R dataframe back to our Python interpreter.

%%R -o df_r   
df_r = payment %>% 
        filter(amount < 1, customer_id > 500) %>% 
        select(payment_id, customer_id, amount) %>% 
        arrange(payment_id, desc(customer_id), amount)

df_r.shape
df_r.head()

	payment_id	customer_id	amount
0	18121	502	0.99
1	18122	502	0.99
2	18125	503	0.99
3	18134	506	0.99
4	18144	510	0.99

Sanity check

Just to make sure that we’ve run our query correctly, we can compare the result from the SQL query with the results of the Python and R pipelines. Because of the fact that R indexing starts at 1 instead of 0, we need to reset the index of the R dataframe after we export it to Python.

all(df==df_sql), all(df_r==df_sql)

(True, True)

Any other trivial query could be handled in a similar way…

Let’s look at a query where we need to extract a feature that doesn’t appear explicitly in the original data.

Which transactions occured in the month of March only?

The payment table has a payment_date feature from which we will need to extract the month.

▶ SQL

We can extract the month value out of the payment_date column using the EXTRACT function.

q = '''
SELECT 
    p.payment_id, p.customer_id cust_id, p.amount, p.payment_date
FROM 
    payment p
WHERE 
    EXTRACT(month FROM p.payment_date) = 3
    AND p.amount < 1
ORDER BY 
    cust_id DESC, 3 ASC;
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()

(1021, 4)

	payment_id	cust_id	amount	payment_date
0	22611	598	0.99	2007-03-02 15:51:25.996577
1	22613	598	0.99	2007-03-21 07:09:41.996577
2	22606	597	0.99	2007-03-19 05:51:32.996577
3	22589	596	0.99	2007-03-01 20:50:37.996577
4	22598	596	0.99	2007-03-22 21:49:07.996577

▶ Python

When we loaded the payment table from the database using sqlalchemy, the payment_date feature was correctly parsed as a datetime object.

This allows us to use the pandas dt accessor to extract a variety of date and time related features, including the month.

df = (payment[(payment.amount < 1) & (payment.payment_date.dt.month==3)]
       .rename({'customer_id': 'cust_id'}, axis=1)
       .reindex(['payment_id','cust_id','amount','payment_date'], axis=1)
       .sort_values(['cust_id', 'amount'], ascending=[False, True])
       .reset_index(drop=True)
     )
      
print(df.shape)
df.head()

(1021, 4)

	payment_id	cust_id	amount	payment_date
0	22611	598	0.99	2007-03-02 15:51:25.996577
1	22613	598	0.99	2007-03-21 07:09:41.996577
2	22606	597	0.99	2007-03-19 05:51:32.996577
3	22589	596	0.99	2007-03-01 20:50:37.996577
4	22592	596	0.99	2007-03-17 15:08:26.996577

▶ R

A great way to handle datetime objects in R is via the lubridate library. To make it clear when we’re using a lubridate object, we won’t load the library, but we call its functions using the lubridate:: prefix.

%%R -o df_r
df_r = payment %>% 
        filter(amount < 1, lubridate::month(payment$payment_date)==3) %>% 
        rename(cust_id=customer_id) %>% 
        select(-rental_id, -staff_id) %>% 
        arrange(desc(cust_id), amount)

print(df_r.shape)
df_r.head()

(1021, 4)

	payment_id	cust_id	amount	payment_date
0	22611	598	0.99	2007-03-02 15:51:25+11:00
1	22613	598	0.99	2007-03-21 07:09:41+11:00
2	22606	597	0.99	2007-03-19 05:51:32+11:00
3	22589	596	0.99	2007-03-01 20:50:37+11:00
4	22592	596	0.99	2007-03-17 15:08:26+11:00

Here we get an glimpse into the infinite joy of working with datetime objects.

During the transfer from Python to R and back to Python, our payment_date format has changed slightly.

Instead of worrying about fixing this we’ll just exclude that column when comparing the R result with the SQL/Python ones.

Henceforth, we won’t be worrying about this issue. We’ll simply check that the dataframes shape match and the first few rows are consistent.

all(df==df_sql), all(df.iloc[:, :-1]==df_r.iloc[:, :-1])

(True, True)

Which day saw the most business (largest number of transactions)?

This is a common operation. We need to group our payments by date, count how many transactions occured on that date and order the results accordingly.

▶ SQL

q = '''
SELECT 
    p.payment_date::date, COUNT(*)
FROM 
    payment p
GROUP BY 1
ORDER BY 2 DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(32, 2)

	payment_date	count
0	2007-04-30	1311
1	2007-03-01	676
2	2007-03-21	673
3	2007-04-27	643
4	2007-04-29	640
5	2007-03-19	631
6	2007-04-28	627
7	2007-03-18	624
8	2007-03-22	621
9	2007-03-20	611
10	2007-03-02	595
11	2007-03-17	584
12	2007-03-23	557
13	2007-04-08	516
14	2007-04-09	514
15	2007-04-06	486
16	2007-04-10	482
17	2007-04-07	472
18	2007-04-11	468
19	2007-04-12	452
20	2007-02-19	310
21	2007-02-15	308
22	2007-02-18	302
23	2007-02-20	291
24	2007-02-17	283
25	2007-02-16	282
26	2007-02-21	213
27	2007-05-14	182
28	2007-04-26	79
29	2007-03-16	72
30	2007-04-05	64
31	2007-02-14	27

▶ Python

The Python pipeline mimics the SQL query rather closely.

df = (payment
       .groupby(payment.payment_date.dt.date)['amount']
       .count().rename('count')
       .sort_values(ascending=False)
       .reset_index()
     )
df

	payment_date	count
0	2007-04-30	1311
1	2007-03-01	676
2	2007-03-21	673
3	2007-04-27	643
4	2007-04-29	640
5	2007-03-19	631
6	2007-04-28	627
7	2007-03-18	624
8	2007-03-22	621
9	2007-03-20	611
10	2007-03-02	595
11	2007-03-17	584
12	2007-03-23	557
13	2007-04-08	516
14	2007-04-09	514
15	2007-04-06	486
16	2007-04-10	482
17	2007-04-07	472
18	2007-04-11	468
19	2007-04-12	452
20	2007-02-19	310
21	2007-02-15	308
22	2007-02-18	302
23	2007-02-20	291
24	2007-02-17	283
25	2007-02-16	282
26	2007-02-21	213
27	2007-05-14	182
28	2007-04-26	79
29	2007-03-16	72
30	2007-04-05	64
31	2007-02-14	27

▶ R

We follow the same pattern for our R code. The conversion to a dataframe at the end (using as.data.frame) is simply to get a nicer formatting of the output.

%%R 
df_r = payment %>% 
        group_by(payment_date=lubridate::date(payment$payment_date)) %>% 
        summarise(n = n()) %>% 
        arrange(desc(n)) %>%
        as.data.frame()
df_r

   payment_date    n
1    2007-04-30 1311
2    2007-03-01  676
3    2007-03-21  673
4    2007-04-27  643
5    2007-04-29  640
6    2007-03-19  631
7    2007-04-28  627
8    2007-03-18  624
9    2007-03-22  621
10   2007-03-20  611
11   2007-03-02  595
12   2007-03-17  584
13   2007-03-23  557
14   2007-04-08  516
15   2007-04-09  514
16   2007-04-06  486
17   2007-04-10  482
18   2007-04-07  472
19   2007-04-11  468
20   2007-04-12  452
21   2007-02-19  310
22   2007-02-15  308
23   2007-02-18  302
24   2007-02-20  291
25   2007-02-17  283
26   2007-02-16  282
27   2007-02-21  213
28   2007-05-14  182
29   2007-04-26   79
30   2007-03-16   72
31   2007-04-05   64
32   2007-02-14   27

How much did each customer spend and when?

We can answer this question in a number of ways. Let’s return a sequence of transaction dates for each customer. In SQL we can do that with an array.

▶ SQL

q = '''
SELECT 
    p.customer_id, 
    SUM(p.amount) total, 
    ARRAY_AGG(p.payment_date::date) dates
FROM 
    payment p
GROUP BY 1
ORDER BY 2 DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()

(599, 3)

	customer_id	total	dates
0	148	211.55	[2007-02-15, 2007-02-15, 2007-02-19, 2007-02-1...
1	526	208.58	[2007-02-15, 2007-02-16, 2007-02-17, 2007-02-1...
2	178	194.61	[2007-02-15, 2007-02-15, 2007-02-16, 2007-02-1...
3	137	191.62	[2007-02-18, 2007-02-19, 2007-02-20, 2007-02-2...
4	144	189.60	[2007-02-16, 2007-02-17, 2007-02-19, 2007-02-2...

▶ Python

In Python we’ll gather the transaction dates for each customer as a list.

We can build that list using a lambda (anonymous) function during the aggregation step.

df = (payment
       .groupby('customer_id')[['amount', 'payment_date']]
       .agg({
               'amount':'sum',
               'payment_date': lambda x: list(x.dt.date)
            })
       .rename(columns={'amount':'total', 'payment_date': 'date'})
       .sort_values('total', ascending=False)
       .reset_index()
      )
print(df.shape)
df.head()

(599, 3)

	customer_id	total	date
0	148	211.55	[2007-02-15, 2007-02-15, 2007-02-19, 2007-02-1...
1	526	208.58	[2007-02-15, 2007-02-16, 2007-02-17, 2007-02-1...
2	178	194.61	[2007-02-15, 2007-02-15, 2007-02-16, 2007-02-1...
3	137	191.62	[2007-02-18, 2007-02-19, 2007-02-20, 2007-02-2...
4	144	189.60	[2007-02-16, 2007-02-17, 2007-02-19, 2007-02-2...

▶ R

For this query the R code is similar to the Python one (syntactic idiosyncrasies aside).

%%R 
df_r = payment %>% 
      mutate(payment_date = as.Date(payment_date)) %>% 
      group_by(customer_id) %>% 
      summarise(total=sum(amount), dates=list(payment_date)) %>% 
      arrange(desc(total))

print(dim(df_r))
print(head(df_r, 5))

[1] 599   3
# A tibble: 5 x 3
  customer_id total dates      
        <int> <dbl> <list>     
1         148  212. <date [45]>
2         526  209. <date [42]>
3         178  195. <date [39]>
4         137  192. <date [38]>
5         144  190. <date [40]>

At first glance it looks like the results are different from the SQL/Python, but this is purely cosmetic. The R data structure (a tibble actually) rounds up the floating point values and shrinks the lists when displaying the results. We could convert the tibble to a R dataframe but the output would be unwieldy. Instead, we can unpack the first row to show that the data has been parsed correctly.

%%R
print(df_r$total[1]) 
print(df_r$dates[1])

[1] 211.55
[[1]]
 [1] "2007-02-15" "2007-02-15" "2007-02-19" "2007-02-19" "2007-02-19"
 [6] "2007-03-01" "2007-03-02" "2007-03-17" "2007-03-17" "2007-03-18"
[11] "2007-03-18" "2007-03-18" "2007-03-20" "2007-03-20" "2007-03-20"
[16] "2007-03-21" "2007-03-21" "2007-03-21" "2007-03-21" "2007-03-22"
[21] "2007-03-22" "2007-03-22" "2007-03-22" "2007-04-05" "2007-04-06"
[26] "2007-04-08" "2007-04-08" "2007-04-08" "2007-04-09" "2007-04-10"
[31] "2007-04-11" "2007-04-12" "2007-04-27" "2007-04-27" "2007-04-27"
[36] "2007-04-28" "2007-04-28" "2007-04-28" "2007-04-29" "2007-04-29"
[41] "2007-04-29" "2007-04-29" "2007-04-30" "2007-04-29" "2007-04-30"

Which days of March 2007 recorded no sale?

This may seem like the sort of query we’ve already looked at, however there is a catch. The database only records transactions that have occurred (it seems obvious when said like that!). This means that dates with no sales aren’t in the database. Therefore we need to generate a series of dates (with a daily frequency) and join our payment data on it to find the days with no sales.

In PostgreSQL we can use generate_series to do that. Note that we select from it as if it were a table (which we name gs). By joining on it we can “expand” our payment data to include the dates for which no transaction took place. These would of course contain NULL values for the transaction data. By identifying where those NULL rows are we can answer the original question.

▶ SQL

q = '''
SELECT 
    gs::date, p.*
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date
WHERE 
    payment_date is NULL
'''

df_sql = pd.read_sql_query(q, con).fillna('')['gs']
print(df_sql.shape)
df_sql

(21,)
0     2007-03-03
1     2007-03-04
2     2007-03-05
3     2007-03-06
4     2007-03-07
5     2007-03-08
6     2007-03-09
7     2007-03-10
8     2007-03-11
9     2007-03-12
10    2007-03-13
11    2007-03-14
12    2007-03-15
13    2007-03-24
14    2007-03-25
15    2007-03-26
16    2007-03-27
17    2007-03-28
18    2007-03-29
19    2007-03-30
20    2007-03-31
Name: gs, dtype: object

▶ Python

We could do something similar in Python, however let’s try a more pythonic approach and use set operations to extract the dates we want. We simply construct the set of all dates in March 2007 and take the difference with the dates that are already in the database. The moral of the story here is that whereas the Python (and R) syntax can often look SQLish, this is not always the most concise or even suitable way to get to the answer.

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = pd.DataFrame(index = set(gs.date) - set(payment.payment_date.dt.date.values)).sort_index()
print(df.shape)
df

(21, 0)


2007-03-03
2007-03-04
2007-03-05
2007-03-06
2007-03-07
2007-03-08
2007-03-09
2007-03-10
2007-03-11
2007-03-12
2007-03-13
2007-03-14
2007-03-15
2007-03-24
2007-03-25
2007-03-26
2007-03-27
2007-03-28
2007-03-29
2007-03-30
2007-03-31

▶ R

Set operations also work in R. The output is a bit strange. We have a list of lists with a single element. We could flatten the nested list but it would only obfuscate the code so we’ll leave it like that.

%%R
gs = seq(lubridate::ymd('2007-03-01'), lubridate::ymd('2007-03-31'), by = '1 day')
lapply(sort(setdiff(gs, lubridate::date(payment$payment_date))), as.Date, origin='1970-01-01')

[[1]]
[1] "2007-03-03"

[[2]]
[1] "2007-03-04"

[[3]]
[1] "2007-03-05"

[[4]]
[1] "2007-03-06"

[[5]]
[1] "2007-03-07"

[[6]]
[1] "2007-03-08"

[[7]]
[1] "2007-03-09"

[[8]]
[1] "2007-03-10"

[[9]]
[1] "2007-03-11"

[[10]]
[1] "2007-03-12"

[[11]]
[1] "2007-03-13"

[[12]]
[1] "2007-03-14"

[[13]]
[1] "2007-03-15"

[[14]]
[1] "2007-03-24"

[[15]]
[1] "2007-03-25"

[[16]]
[1] "2007-03-26"

[[17]]
[1] "2007-03-27"

[[18]]
[1] "2007-03-28"

[[19]]
[1] "2007-03-29"

[[20]]
[1] "2007-03-30"

[[21]]
[1] "2007-03-31"

How many transactions were there for each day of March 2007?

This is a simple question, we just need to keep in mind that some days had no transaction. We should still ouput those with a count of zero. The basic idea is similar to the previous example. We create a time series to “pad out” the missing days in the payment_date feature.

▶ SQL

q = '''
SELECT 
    gs::date date, 
    COUNT(p.*) number_of_transactions
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date 
GROUP BY 
    gs::date 
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(31, 2)

	date	number_of_transactions
0	2007-03-01	676
1	2007-03-02	595
2	2007-03-03	0
3	2007-03-04	0
4	2007-03-05	0
5	2007-03-06	0
6	2007-03-07	0
7	2007-03-08	0
8	2007-03-09	0
9	2007-03-10	0
10	2007-03-11	0
11	2007-03-12	0
12	2007-03-13	0
13	2007-03-14	0
14	2007-03-15	0
15	2007-03-16	72
16	2007-03-17	584
17	2007-03-18	624
18	2007-03-19	631
19	2007-03-20	611
20	2007-03-21	673
21	2007-03-22	621
22	2007-03-23	557
23	2007-03-24	0
24	2007-03-25	0
25	2007-03-26	0
26	2007-03-27	0
27	2007-03-28	0
28	2007-03-29	0
29	2007-03-30	0
30	2007-03-31	0

▶ Python

Similarly, in Python we can create the series of dates for March 2007 using the date_range object from pandas.

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = (payment
       .assign(payment_date=pd.to_datetime(payment.payment_date).dt.date)
       .groupby('payment_date')
       .agg({'amount':'count'})
       .reindex(gs, fill_value=0)
       .reset_index()
       .rename(columns={'index':'date','amount':'number_of_transactions'})
)

print(df.shape)
df

(31, 2)

	date	number_of_transactions
0	2007-03-01	676
1	2007-03-02	595
2	2007-03-03	0
3	2007-03-04	0
4	2007-03-05	0
5	2007-03-06	0
6	2007-03-07	0
7	2007-03-08	0
8	2007-03-09	0
9	2007-03-10	0
10	2007-03-11	0
11	2007-03-12	0
12	2007-03-13	0
13	2007-03-14	0
14	2007-03-15	0
15	2007-03-16	72
16	2007-03-17	584
17	2007-03-18	624
18	2007-03-19	631
19	2007-03-20	611
20	2007-03-21	673
21	2007-03-22	621
22	2007-03-23	557
23	2007-03-24	0
24	2007-03-25	0
25	2007-03-26	0
26	2007-03-27	0
27	2007-03-28	0
28	2007-03-29	0
29	2007-03-30	0
30	2007-03-31	0

▶ R

Once again, the pattern for R follows the same principles.

%%R 
gs = data_frame(payment_date=seq(lubridate::ymd('2007-03-01'), 
                                 lubridate::ymd('2007-03-31'), 
                                 by = '1 day'))

df_r = payment %>% 
      mutate(payment_date=lubridate::date(payment_date)) %>% 
      group_by(payment_date) %>% 
      summarise(number_of_transactions=n()) %>% 
      right_join(gs, by='payment_date') %>%
      replace_na(list(number_of_transactions=0))

as.data.frame(df_r)

   payment_date number_of_transactions
1    2007-03-01                    676
2    2007-03-02                    595
3    2007-03-03                      0
4    2007-03-04                      0
5    2007-03-05                      0
6    2007-03-06                      0
7    2007-03-07                      0
8    2007-03-08                      0
9    2007-03-09                      0
10   2007-03-10                      0
11   2007-03-11                      0
12   2007-03-12                      0
13   2007-03-13                      0
14   2007-03-14                      0
15   2007-03-15                      0
16   2007-03-16                     72
17   2007-03-17                    584
18   2007-03-18                    624
19   2007-03-19                    631
20   2007-03-20                    611
21   2007-03-21                    673
22   2007-03-22                    621
23   2007-03-23                    557
24   2007-03-24                      0
25   2007-03-25                      0
26   2007-03-26                      0
27   2007-03-27                      0
28   2007-03-28                      0
29   2007-03-29                      0
30   2007-03-30                      0
31   2007-03-31                      0

Days with no transactions in March 2007 - Alternate version

Let’s revisit the problem we dealt with previous using set operations in Python and R and illustrate how we could answer the question using a more SQLish pipeline. We can reuse the code from the previous example. We just need to add a filtering step.

▶ SQL

q = '''
SELECT 
    gs::date date, COUNT(p.*) number_of_transactions
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date 
GROUP BY 
    gs::date 
HAVING 
    COUNT(p.*) = 0
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(21, 2)

	date	number_of_transactions
0	2007-03-03	0
1	2007-03-04	0
2	2007-03-05	0
3	2007-03-06	0
4	2007-03-07	0
5	2007-03-08	0
6	2007-03-09	0
7	2007-03-10	0
8	2007-03-11	0
9	2007-03-12	0
10	2007-03-13	0
11	2007-03-14	0
12	2007-03-15	0
13	2007-03-24	0
14	2007-03-25	0
15	2007-03-26	0
16	2007-03-27	0
17	2007-03-28	0
18	2007-03-29	0
19	2007-03-30	0
20	2007-03-31	0

▶ Python

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = (payment
       .assign(payment_date=pd.to_datetime(payment.payment_date).dt.date)
       .groupby('payment_date')
       .agg({'amount':'count'})
       .rename(columns={'amount':'number_of_transactions'})
       .reindex(gs, fill_value=0)
       .query('number_of_transactions == 0')  
)

print(df.shape)
df

(21, 1)

	number_of_transactions
2007-03-03	0
2007-03-04	0
2007-03-05	0
2007-03-06	0
2007-03-07	0
2007-03-08	0
2007-03-09	0
2007-03-10	0
2007-03-11	0
2007-03-12	0
2007-03-13	0
2007-03-14	0
2007-03-15	0
2007-03-24	0
2007-03-25	0
2007-03-26	0
2007-03-27	0
2007-03-28	0
2007-03-29	0
2007-03-30	0
2007-03-31	0

▶ R

%%R
gs = data_frame(payment_date=seq(lubridate::ymd('2007-03-01'), 
                                 lubridate::ymd('2007-03-31'), 
                                 by = '1 day'))

df_r = payment %>% 
      mutate(payment_date=lubridate::date(payment_date)) %>% 
      group_by(payment_date) %>% 
      summarise(number_of_transactions=n()) %>% 
      right_join(gs, by="payment_date") %>%
      replace_na(list(number_of_transactions=0)) %>%
      subset(number_of_transactions==0)

print(dim(df_r))
as.data.frame(df_r)

[1] 21  2
   payment_date number_of_transactions
1    2007-03-03                      0
2    2007-03-04                      0
3    2007-03-05                      0
4    2007-03-06                      0
5    2007-03-07                      0
6    2007-03-08                      0
7    2007-03-09                      0
8    2007-03-10                      0
9    2007-03-11                      0
10   2007-03-12                      0
11   2007-03-13                      0
12   2007-03-14                      0
13   2007-03-15                      0
14   2007-03-24                      0
15   2007-03-25                      0
16   2007-03-26                      0
17   2007-03-27                      0
18   2007-03-28                      0
19   2007-03-29                      0
20   2007-03-30                      0
21   2007-03-31                      0

Which movies have never been rented?

So far we’ve only worked with a single table out of the 15 contained in the database. To match movies to rental transactions we will need to use 3 new tables. The film one contains information about the movies, the inventory one links the movies to an inventory id, which is used in the rental table to connect movies to actual rentals.

Let’s have a look at the structures of these tables.

for t in ['inventory', 'payment', 'rental', 'film']:
    q = f'''
SELECT * FROM {t}
LIMIT 1
'''
    df = pd.read_sql_query(q, con)
    print(t)
    display(df)

inventory

	inventory_id	film_id	store_id	last_update
0	1	1	1	2006-02-15 10:09:17

payment

	payment_id	customer_id	staff_id	rental_id	amount	payment_date
0	17503	341	2	1520	7.99	2007-02-15 22:25:46.996577

rental

	rental_id	rental_date	inventory_id	customer_id	return_date	staff_id	last_update
0	2	2005-05-24 22:54:33	1525	459	2005-05-28 19:40:33	1	2006-02-16 02:30:53

film

	film_id	title	description	release_year	language_id	rental_duration	rental_rate	length	replacement_cost	rating	last_update	special_features	fulltext
0	133	Chamber Italian	A Fateful Reflection of a Moose And a Husband ...	2006	1	7	4.99	117	14.99	NC-17	2013-05-26 14:50:58.951	[Trailers]	'chamber':1 'fate':4 'husband':11 'italian':2 ...

▶ SQL

Let’s join the tables, count how many times each movie has been rented and select those for which the count is zero.

q = '''
SELECT 
    t.film_id, t.title, t.rentals 
FROM 
    (SELECT 
        f.film_id, f.title, 
        COUNT(distinct r.rental_id) as rentals
    FROM film f
        LEFT JOIN inventory i ON i.film_id = f.film_id
        LEFT JOIN rental r ON r.inventory_id = i.inventory_id
    GROUP BY 1,2
    HAVING 
        COUNT(distinct r.rental_id) = 0
        ) t
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(42, 3)

	film_id	title
0	14	Alice Fantasia
1	33	Apollo Teen
2	36	Argonauts Town
3	38	Ark Ridgemont
4	41	Arsenic Independence
5	87	Boondock Ballroom
6	108	Butch Panther
7	128	Catch Amistad
8	144	Chinatown Gladiator
9	148	Chocolate Duck
10	171	Commandments Express
11	192	Crossing Divorce
12	195	Crowds Telemark
13	198	Crystal Breaking
14	217	Dazed Punk
15	221	Deliverance Mulholland
16	318	Firehouse Vietnam
17	325	Floats Garden
18	332	Frankenstein Stranger
19	359	Gladiator Westward
20	386	Gump Date
21	404	Hate Handicap
22	419	Hocus Frida
23	495	Kentuckian Giant
24	497	Kill Brotherhood
25	607	Muppet Mile
26	642	Order Betrayed
27	669	Pearl Destiny
28	671	Perdition Fargo
29	701	Psycho Shrunk
30	712	Raiders Antitrust
31	713	Rainbow Shock
32	742	Roof Champion
33	801	Sister Freddy
34	802	Sky Miracle
35	860	Suicides Silence
36	874	Tadpole Park
37	909	Treasure Command
38	943	Villain Desperate
39	950	Volume House
40	954	Wake Jaws
41	955	Walls Artist

▶ Python

The go to approach for join operations in pandas is the merge method. The type of join can be specified as an optional parameter.

film, inventory, rental = tbl['film'], tbl['inventory'], tbl['rental']

df = (film
        .merge(inventory, on='film_id', how='left')
        .merge(rental, on='inventory_id', how='left')
        .groupby(['film_id','title'])[['rental_id']]
        .count()
        .rename(columns={'rental_id':'rentals'})
        .query('rentals == 0')
        .reset_index()
     )
print(df.shape)
df

(42, 3)

	film_id	title
0	14	Alice Fantasia
1	33	Apollo Teen
2	36	Argonauts Town
3	38	Ark Ridgemont
4	41	Arsenic Independence
5	87	Boondock Ballroom
6	108	Butch Panther
7	128	Catch Amistad
8	144	Chinatown Gladiator
9	148	Chocolate Duck
10	171	Commandments Express
11	192	Crossing Divorce
12	195	Crowds Telemark
13	198	Crystal Breaking
14	217	Dazed Punk
15	221	Deliverance Mulholland
16	318	Firehouse Vietnam
17	325	Floats Garden
18	332	Frankenstein Stranger
19	359	Gladiator Westward
20	386	Gump Date
21	404	Hate Handicap
22	419	Hocus Frida
23	495	Kentuckian Giant
24	497	Kill Brotherhood
25	607	Muppet Mile
26	642	Order Betrayed
27	669	Pearl Destiny
28	671	Perdition Fargo
29	701	Psycho Shrunk
30	712	Raiders Antitrust
31	713	Rainbow Shock
32	742	Roof Champion
33	801	Sister Freddy
34	802	Sky Miracle
35	860	Suicides Silence
36	874	Tadpole Park
37	909	Treasure Command
38	943	Villain Desperate
39	950	Volume House
40	954	Wake Jaws
41	955	Walls Artist

▶ R

We start by “transfering” the new dataframes to R. The rest of the pipeline is very much like the one used in Python.

%%R -i rental -i inventory -i film -o df_r
df_r = film %>% 
        left_join(inventory, by="film_id") %>% 
        left_join(rental, by="inventory_id") %>% 
        group_by(film_id, title) %>% 
        summarise(rentals=n_distinct(rental_id, na.rm=TRUE)) %>% 
        filter(rentals==0) %>%
        as.data.frame()

df_r

   film_id                  title rentals
1       14         Alice Fantasia       0
2       33            Apollo Teen       0
3       36         Argonauts Town       0
4       38          Ark Ridgemont       0
5       41   Arsenic Independence       0
6       87      Boondock Ballroom       0
7      108          Butch Panther       0
8      128          Catch Amistad       0
9      144    Chinatown Gladiator       0
10     148         Chocolate Duck       0
11     171   Commandments Express       0
12     192       Crossing Divorce       0
13     195        Crowds Telemark       0
14     198       Crystal Breaking       0
15     217             Dazed Punk       0
16     221 Deliverance Mulholland       0
17     318      Firehouse Vietnam       0
18     325          Floats Garden       0
19     332  Frankenstein Stranger       0
20     359     Gladiator Westward       0
21     386              Gump Date       0
22     404          Hate Handicap       0
23     419            Hocus Frida       0
24     495       Kentuckian Giant       0
25     497       Kill Brotherhood       0
26     607            Muppet Mile       0
27     642         Order Betrayed       0
28     669          Pearl Destiny       0
29     671        Perdition Fargo       0
30     701          Psycho Shrunk       0
31     712      Raiders Antitrust       0
32     713          Rainbow Shock       0
33     742          Roof Champion       0
34     801          Sister Freddy       0
35     802            Sky Miracle       0
36     860       Suicides Silence       0
37     874           Tadpole Park       0
38     909       Treasure Command       0
39     943      Villain Desperate       0
40     950           Volume House       0
41     954              Wake Jaws       0
42     955           Walls Artist       0

all(df==df_sql), all(df==df_r)

(True, True)

Find each customer’s first order.

To solve the problem in SQL we need to use a correlated subquery whereby the inner query gets executed for every row of the outer query.

In our case, the inner query gets executed for every row of the outer query because min(rental_id) will change with each customer.

▶ SQL

q = '''
SELECT 
    r.customer_id, 
    min(r.rental_id) first_order_id, 
    (    
        SELECT r2.rental_date::date FROM rental r2 
        WHERE r2.rental_id = min(r.rental_id)
    ) first_order_date
FROM 
    rental r
GROUP BY 1
ORDER BY 1
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
display(df_sql.head())

(599, 3)

	customer_id	first_order_id	first_order_date
0	1	76	2005-05-25
1	2	320	2005-05-27
2	3	435	2005-05-27
3	4	1297	2005-06-15
4	5	731	2005-05-29

▶ Python

In this case the Python syntax is a bit simpler. We group by the customer_id and pick the smallest rental_id.

df = (rental
        .assign(rental_date=rental.rental_date.dt.date)
        .groupby(['customer_id' ])['rental_id','rental_date']
        .min()
        .reset_index()
        .rename(columns={'rental_id':'first_order_id', 'rental_date':'first_order_date'})
      )

print(df.shape)
display(df.head())

(599, 3)

	customer_id	first_order_id	first_order_date
0	1	76	2005-05-25
1	2	320	2005-05-27
2	3	435	2005-05-27
3	4	1297	2005-06-15
4	5	731	2005-05-29

all(df==df_sql)

True

▶ R

The R version of the code is similar to the Python one.

%%R 

df_r = rental %>% 
        mutate(rental_date=lubridate::date(rental_date)) %>% 
        group_by(customer_id) %>% 
        summarise_at(vars(rental_id, rental_date), funs(min)) %>%
        as.data.frame()

print(dim(df_r))
print(head(df_r, 5))

[1] 599   3
  customer_id rental_id rental_date
1           1        76  2005-05-25
2           2       320  2005-05-27
3           3       435  2005-05-27
4           4      1297  2005-06-15
5           5       731  2005-05-29

Handling multiple conditions.

Let’s look at a more complex query. We’d like to return all the rental transactions conducted by the staff with id 1, for which the rental id is larger than 15000, and the payment occured between 3am and 4am or 11pm and midnight in March 2007.

▶ SQL

Although more complex than our previous examples, this query can be built from the pieces we’ve used before. To keep things somewhat simple, we’ll assume that between 3am and 4am means between 3:00 and 3:59. This will allow use to only concern ourselves with the hour number when filtering our data. Same with the second time period.

q = '''
WITH base_table AS (
    SELECT gs::date AS date,  p.*
    FROM 
        generate_series('2007-03-01', '2007-03-31', INTERVAL '1 day') as gs
    LEFT JOIN payment p ON p.payment_date::date=gs::date AND p.staff_id = 1
    ORDER BY 1 NULLS FIRST
) 
SELECT 
    bt.date, bt.payment_id, bt.customer_id, bt.staff_id, bt.rental_id, bt.amount, 
    EXTRACT(hour FROM bt.payment_date)::int AS hour
FROM 
    base_table bt
WHERE 
    bt.rental_id > 15000 
    AND         
    EXTRACT(hour FROM bt.payment_date) IN (4,23)
ORDER BY bt.payment_date, bt.rental_id
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(36, 7)

	date	payment_id	customer_id	staff_id	rental_id	amount	hour
0	2007-03-22	23134	46	1	15438	2.99	23
1	2007-03-22	23958	136	1	15439	4.99	23
2	2007-03-22	23095	42	1	15442	2.99	23
3	2007-03-22	22615	598	1	15443	7.99	23
4	2007-03-22	22960	28	1	15445	4.99	23
5	2007-03-22	20574	379	1	15446	4.99	23
6	2007-03-22	21651	490	1	15448	2.99	23
7	2007-03-22	20027	322	1	15450	0.99	23
8	2007-03-22	21866	514	1	15451	2.99	23
9	2007-03-22	20370	359	1	15453	1.99	23
10	2007-03-22	24882	238	1	15455	0.99	23
11	2007-03-22	25113	262	1	15456	0.99	23
12	2007-03-22	19906	306	1	15457	2.99	23
13	2007-03-22	22876	20	1	15460	2.99	23
14	2007-03-22	23646	103	1	15461	5.99	23
15	2007-03-22	20675	389	1	15462	5.99	23
16	2007-03-22	23869	127	1	15463	5.99	23
17	2007-03-22	23283	62	1	15464	6.99	23
18	2007-03-22	21921	520	1	15465	0.99	23
19	2007-03-22	20970	418	1	15466	4.99	23
20	2007-03-22	23647	103	1	15467	3.99	23
21	2007-03-22	20768	399	1	15468	4.99	23
22	2007-03-22	22610	597	1	15469	4.99	23
23	2007-03-23	24692	215	1	15583	2.99	4
24	2007-03-23	21731	500	1	15584	2.99	4
25	2007-03-23	22149	545	1	15585	0.99	4
26	2007-03-23	21723	499	1	15587	7.99	4
27	2007-03-23	21764	503	1	15588	3.99	4
28	2007-03-23	22901	22	1	15589	6.99	4
29	2007-03-23	23284	62	1	15591	0.99	4
30	2007-03-23	24163	153	1	15593	1.99	4
31	2007-03-23	25135	264	1	15595	4.99	4
32	2007-03-23	24478	186	1	15596	0.99	4
33	2007-03-23	23323	66	1	15598	8.99	4
34	2007-03-23	23648	103	1	15599	5.99	4
35	2007-03-23	23729	113	1	15600	1.99	4

▶ Python

Just for some variety we use the between method of pandas series to select the March 2007 data. We pass pandas Timestamp objects to mark the first and last day of the month. The rest of the pipeline should be self-explanatory. The last line is added for convenience. We order the columns of the pandas dataframe to match the ordering of the columns in our SQL result.

df = (payment
  .assign(date=payment.payment_date.dt.date)
  .loc[payment.payment_date.between(
              pd.Timestamp(year=2007, month=3, day=1), 
              pd.Timestamp(year=2007, month=3, day=31))
      ]
  .assign(hour=payment.payment_date.dt.hour)
  .query('(hour in (4, 23)) & (rental_id > 15000) & (staff_id == 1)')
  .sort_values('payment_date')
  .reset_index()
  .loc[:, df_sql.columns]
) 

print(df.shape)
df

(36, 7)

	date	payment_id	customer_id	staff_id	rental_id	amount	hour
0	2007-03-22	23134	46	1	15438	2.99	23
1	2007-03-22	23958	136	1	15439	4.99	23
2	2007-03-22	23095	42	1	15442	2.99	23
3	2007-03-22	22615	598	1	15443	7.99	23
4	2007-03-22	22960	28	1	15445	4.99	23
5	2007-03-22	20574	379	1	15446	4.99	23
6	2007-03-22	21651	490	1	15448	2.99	23
7	2007-03-22	20027	322	1	15450	0.99	23
8	2007-03-22	21866	514	1	15451	2.99	23
9	2007-03-22	20370	359	1	15453	1.99	23
10	2007-03-22	24882	238	1	15455	0.99	23
11	2007-03-22	25113	262	1	15456	0.99	23
12	2007-03-22	19906	306	1	15457	2.99	23
13	2007-03-22	22876	20	1	15460	2.99	23
14	2007-03-22	23646	103	1	15461	5.99	23
15	2007-03-22	20675	389	1	15462	5.99	23
16	2007-03-22	23869	127	1	15463	5.99	23
17	2007-03-22	23283	62	1	15464	6.99	23
18	2007-03-22	21921	520	1	15465	0.99	23
19	2007-03-22	20970	418	1	15466	4.99	23
20	2007-03-22	23647	103	1	15467	3.99	23
21	2007-03-22	20768	399	1	15468	4.99	23
22	2007-03-22	22610	597	1	15469	4.99	23
23	2007-03-23	24692	215	1	15583	2.99	4
24	2007-03-23	21731	500	1	15584	2.99	4
25	2007-03-23	22149	545	1	15585	0.99	4
26	2007-03-23	21723	499	1	15587	7.99	4
27	2007-03-23	21764	503	1	15588	3.99	4
28	2007-03-23	22901	22	1	15589	6.99	4
29	2007-03-23	23284	62	1	15591	0.99	4
30	2007-03-23	24163	153	1	15593	1.99	4
31	2007-03-23	25135	264	1	15595	4.99	4
32	2007-03-23	24478	186	1	15596	0.99	4
33	2007-03-23	23323	66	1	15598	8.99	4
34	2007-03-23	23648	103	1	15599	5.99	4
35	2007-03-23	23729	113	1	15600	1.99	4

all(df==df_sql)

True

▶ R

Once again a similar pipeline can be build in R.

%%R -o df_r

df_r = payment %>% 
  mutate(date=lubridate::date(payment_date)) %>% 
  filter(date >= as.Date('2007-03-01') &  date <= as.Date('2007-03-31')) %>% 
  mutate(hour=lubridate::hour(payment_date)) %>% 
  filter(hour %in% c(4, 23) & rental_id > 15000 & staff_id == 1) %>% 
  arrange(payment_date) %>% 
  select(date, payment_id, customer_id, staff_id, rental_id, amount, hour) %>% 
  as.data.frame()

print(dim(df_r))
df_r

[1] 36  7
         date payment_id customer_id staff_id rental_id amount hour
1  2007-03-22      23134          46        1     15438   2.99   23
2  2007-03-22      23958         136        1     15439   4.99   23
3  2007-03-22      23095          42        1     15442   2.99   23
4  2007-03-22      22615         598        1     15443   7.99   23
5  2007-03-22      22960          28        1     15445   4.99   23
6  2007-03-22      20574         379        1     15446   4.99   23
7  2007-03-22      21651         490        1     15448   2.99   23
8  2007-03-22      20027         322        1     15450   0.99   23
9  2007-03-22      21866         514        1     15451   2.99   23
10 2007-03-22      20370         359        1     15453   1.99   23
11 2007-03-22      24882         238        1     15455   0.99   23
12 2007-03-22      25113         262        1     15456   0.99   23
13 2007-03-22      19906         306        1     15457   2.99   23
14 2007-03-22      22876          20        1     15460   2.99   23
15 2007-03-22      23646         103        1     15461   5.99   23
16 2007-03-22      20675         389        1     15462   5.99   23
17 2007-03-22      23869         127        1     15463   5.99   23
18 2007-03-22      23283          62        1     15464   6.99   23
19 2007-03-22      21921         520        1     15465   0.99   23
20 2007-03-22      20970         418        1     15466   4.99   23
21 2007-03-22      23647         103        1     15467   3.99   23
22 2007-03-22      20768         399        1     15468   4.99   23
23 2007-03-22      22610         597        1     15469   4.99   23
24 2007-03-23      24692         215        1     15583   2.99    4
25 2007-03-23      21731         500        1     15584   2.99    4
26 2007-03-23      22149         545        1     15585   0.99    4
27 2007-03-23      21723         499        1     15587   7.99    4
28 2007-03-23      21764         503        1     15588   3.99    4
29 2007-03-23      22901          22        1     15589   6.99    4
30 2007-03-23      23284          62        1     15591   0.99    4
31 2007-03-23      24163         153        1     15593   1.99    4
32 2007-03-23      25135         264        1     15595   4.99    4
33 2007-03-23      24478         186        1     15596   0.99    4
34 2007-03-23      23323          66        1     15598   8.99    4
35 2007-03-23      23648         103        1     15599   5.99    4
36 2007-03-23      23729         113        1     15600   1.99    4

Customer lifetime value

Which customers made their first order on a weekend, paid more than \$5, and have a customer lifetime value (total amount spent) which exceeds \$100?

▶ SQL

For this query we need to extract the day of the week for each transaction. This can be done using dow.

Note that in PostgreSQL Sunday is coded as 0, and Saturday as 6.

q = '''
SELECT t.* FROM (
    SELECT 
        p.*, 
        EXTRACT(dow FROM p.payment_date)::int  dow,
        (
            SELECT SUM(p3.amount) 
            FROM payment p3
            WHERE p3.customer_id = p.customer_id   
        ) as CLV
    FROM 
        payment p
    WHERE 
        p.payment_id = (
            SELECT MIN(p2.payment_id)
            FROM payment p2
            WHERE p.customer_id = p2.customer_id
        ) 
        AND 
            EXTRACT(dow FROM p.payment_date) IN (0, 6)
        AND 
            p.amount > 5     

    GROUP BY 1
)t WHERE t.CLV > 100
ORDER BY t.CLV DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(17, 8)

	payment_id	customer_id	staff_id	rental_id	amount	payment_date	dow	clv
0	19029	137	1	2469	6.99	2007-02-18 18:52:49.996577	0	191.62
1	18572	21	2	2235	7.99	2007-02-18 02:37:16.996577	0	146.68
2	17526	346	1	1994	5.99	2007-02-17 09:35:32.996577	6	145.70
3	19502	265	2	2027	7.99	2007-02-17 11:35:22.996577	6	132.72
4	17509	342	2	2190	5.99	2007-02-17 23:58:17.996577	6	130.68
5	17866	436	1	2291	9.99	2007-02-18 06:05:12.996577	0	126.73
6	18099	497	2	2180	8.99	2007-02-17 23:16:09.996577	6	121.73
7	18995	128	2	2519	7.99	2007-02-18 22:47:47.996577	0	118.70
8	19496	263	2	2126	8.99	2007-02-17 19:23:02.996577	6	116.73
9	18636	32	2	1887	6.99	2007-02-17 02:21:44.996577	6	112.74
10	19345	225	2	2226	7.99	2007-02-18 02:08:22.996577	0	111.76
11	18395	579	2	2425	5.99	2007-02-18 16:06:11.996577	0	111.73
12	18554	16	2	1934	6.99	2007-02-17 05:33:23.996577	6	109.75
13	18666	40	2	2470	7.99	2007-02-18 18:56:57.996577	0	105.74
14	18446	593	2	2055	5.99	2007-02-17 13:55:29.996577	6	101.76
15	18367	572	2	1889	10.99	2007-02-17 02:33:38.996577	6	100.76
16	19441	251	1	2238	6.99	2007-02-18 02:50:32.996577	0	100.75

▶ Python

Let’s break the pipeline in two parts. First we compute the subset of customers who placed their first order on a week-end.

Note that unlike PostgreSQL, pandas codes Saturday as 5 and Sunday as 6.

subset_fo = (payment
               .assign(dow=lambda x: x.payment_date.dt.weekday) # extract the day of the week as an integer
               .groupby('customer_id')[['payment_date', 'payment_id', 'amount', 'rental_id', 'staff_id', 'dow']]
               .first()                                         # pick the first row for each group
               .query('(amount > 5) & (dow in [5, 6])')
             )
subset_fo.head()

	payment_date	payment_id	amount	rental_id	staff_id	dow
customer_id
16	2007-02-17 05:33:23.996577	18554	6.99	1934	2	5
17	2007-02-17 22:46:24.996577	18558	5.99	2175	2	5
21	2007-02-18 02:37:16.996577	18572	7.99	2235	2	6
32	2007-02-17 02:21:44.996577	18636	6.99	1887	2	5
40	2007-02-18 18:56:57.996577	18666	7.99	2470	2	6

We then join the payment table on this subset to get the information we want.

df = (payment
       .loc[payment.customer_id.isin(subset_fo.index)]
       .groupby('customer_id')[['amount']].sum()
       .rename(columns={'amount':'clv'})
       .query('clv >= 100')
       .join(subset_fo, how='left')
       .sort_values('clv', ascending=False)
       .reset_index()
       .loc[:, df_sql.columns]
    )

print(df.shape)
df

(17, 8)

	payment_id	customer_id	staff_id	rental_id	amount	payment_date	dow	clv
0	19029	137	1	2469	6.99	2007-02-18 18:52:49.996577	6	191.62
1	18572	21	2	2235	7.99	2007-02-18 02:37:16.996577	6	146.68
2	17526	346	1	1994	5.99	2007-02-17 09:35:32.996577	5	145.70
3	19502	265	2	2027	7.99	2007-02-17 11:35:22.996577	5	132.72
4	17509	342	2	2190	5.99	2007-02-17 23:58:17.996577	5	130.68
5	17866	436	1	2291	9.99	2007-02-18 06:05:12.996577	6	126.73
6	18099	497	2	2180	8.99	2007-02-17 23:16:09.996577	5	121.73
7	18995	128	2	2519	7.99	2007-02-18 22:47:47.996577	6	118.70
8	19496	263	2	2126	8.99	2007-02-17 19:23:02.996577	5	116.73
9	18636	32	2	1887	6.99	2007-02-17 02:21:44.996577	5	112.74
10	19345	225	2	2226	7.99	2007-02-18 02:08:22.996577	6	111.76
11	18395	579	2	2425	5.99	2007-02-18 16:06:11.996577	6	111.73
12	18554	16	2	1934	6.99	2007-02-17 05:33:23.996577	5	109.75
13	18666	40	2	2470	7.99	2007-02-18 18:56:57.996577	6	105.74
14	18446	593	2	2055	5.99	2007-02-17 13:55:29.996577	5	101.76
15	18367	572	2	1889	10.99	2007-02-17 02:33:38.996577	5	100.76
16	19441	251	1	2238	6.99	2007-02-18 02:50:32.996577	6	100.75

all(df==df_sql)

True

▶ R

Let’s do the same with R. We’ll drop the payment_date column at the end so that we can fit all the remaining columns horizontally.

Of course, R (lubridate) uses yet another encoding for the days of the week. Saturday is represented as 7 and Sunday as 1.

%%R

subset_fo = payment %>% 
              group_by(customer_id) %>% 
              mutate(dow=lubridate::wday(payment_date)) %>% 
              filter(row_number()==1 & dow%in% c(1, 7) & amount>5) 

df_r = payment %>% 
        right_join(subset_fo, by="customer_id") %>% 
        group_by(customer_id) %>% 
        summarise(clv=sum(amount.x)) %>% 
        filter(clv > 100) %>% 
        left_join(subset_fo, by='customer_id') %>% 
        select(payment_id, customer_id, staff_id, rental_id, amount, payment_date, dow, clv) %>% 
        arrange(desc(clv)) %>%
        select(-payment_date)

print(dim(df_r))
as.data.frame(df_r)

[1] 17  7
   payment_id customer_id staff_id rental_id amount dow    clv
1       19029         137        1      2469   6.99   1 191.62
2       18572          21        2      2235   7.99   1 146.68
3       17526         346        1      1994   5.99   7 145.70
4       19502         265        2      2027   7.99   7 132.72
5       17509         342        2      2190   5.99   7 130.68
6       17866         436        1      2291   9.99   1 126.73
7       18099         497        2      2180   8.99   7 121.73
8       18995         128        2      2519   7.99   1 118.70
9       19496         263        2      2126   8.99   7 116.73
10      18636          32        2      1887   6.99   7 112.74
11      19345         225        2      2226   7.99   1 111.76
12      18395         579        2      2425   5.99   1 111.73
13      18554          16        2      1934   6.99   7 109.75
14      18666          40        2      2470   7.99   1 105.74
15      18446         593        2      2055   5.99   7 101.76
16      18367         572        2      1889  10.99   7 100.76
17      19441         251        1      2238   6.99   1 100.75

How many movies have a replacement cost above or below the average replacement cost?

Once way to answer this is to compute how many movies have a replacement cost higher than the average, how many have a replacement cost lower or equal to the average, and take the union of the two.

▶ SQL

q = '''
SELECT t.grouping, COUNT(*) FROM (
    SELECT f.*, 'above' as grouping
    FROM film f
    WHERE f.replacement_cost > (SELECT AVG(f2.replacement_cost) FROM film f2) 

    UNION 

    SELECT f.*, 'below_eq' as grouping
    FROM film f
    WHERE f.replacement_cost <= (SELECT AVG(f2.replacement_cost) FROM film f2) 
    )t 
GROUP BY 1
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()

(2, 2)

	grouping	count
0	below_eq	464
1	above	536

▶ Python

The Python syntax is definitely simpler here because we can create a Boolean mask telling us whether the replacement cost is above average, and then use value_counts to tally up the True and False counts.

(film
  .assign(above=film.replacement_cost > film.replacement_cost.mean())
  .loc[:, 'above']
  .value_counts()
  .rename(index={True: 'above', False: 'below_eq'})
)

above       536
below_eq    464
Name: above, dtype: int64

▶ R

The R code is similar to the Python one.

%%R
film %>% 
  mutate(above=replacement_cost>mean(film$replacement_cost)) %>% 
  count(above) %>% 
  mutate(above=if_else(above, 'above', 'below_eq'))

# A tibble: 2 x 2
  above        n
  <chr>    <int>
1 below_eq   464
2 above      536

Which movie generated the more revenue, for each viewer rating category?

▶ SQL

The grouping by rating part is straightforward, however, we want the top performing movie within each group. We can use a window function (the OVER part) to do this.

q = '''
WITH some_table AS (
    SELECT 
        f.film_id, 
        f.title, 
        f.rating, 
        SUM(p.amount),
        ROW_NUMBER() OVER(PARTITION BY f.rating ORDER BY SUM(p.amount) DESC)

    FROM film f
    JOIN inventory i ON f.film_id = i.film_id
    JOIN rental r ON r.inventory_id = i.inventory_id
    JOIN payment p ON p.rental_id = r.rental_id
    GROUP BY 1, 2, 3
    ORDER BY 3) 
    
SELECT st.* FROM some_table st 
WHERE st.row_number = 1
ORDER BY 4 DESC
'''

df_sql = pd.read_sql_query(q, con)
df_sql

	film_id	title	rating	sum	row_number
0	879	Telegraph Voyage	PG	215.75	1
1	1000	Zorro Ark	NC-17	199.72	1
2	460	Innocent Usual	PG-13	191.74	1
3	764	Saturday Lambs	G	190.74	1
4	938	Velvet Terminator	R	152.77	1

▶ Python

The pattern below is a common one. We group by some features to compute an aggregate measure, and then ungroup (reset_index) to group the data again at a different level of granularity.

df = (film
       .merge(inventory, on='film_id', how='left')    
       .merge(rental, on='inventory_id', how='left')
       .merge(payment, on='rental_id', how='left')
       .groupby(['rating','film_id'])[['amount']]
       .sum()
       .sort_values('amount', ascending=False)
       .rename(columns={'amount': 'sum'})
       .reset_index()
       .groupby('rating')
       .first()
       .merge(film, on='film_id', how='left')
       .sort_values('sum', ascending=False)
       .reset_index()
       .loc[:, ['film_id','title','rating','sum']]
      )

df

	film_id	title	rating	sum
0	879	Telegraph Voyage	PG	215.75
1	1000	Zorro Ark	NC-17	199.72
2	460	Innocent Usual	PG-13	191.74
3	764	Saturday Lambs	G	190.74
4	938	Velvet Terminator	R	152.77

▶ R

This is essentially the same approach as in Python.

%%R
df_r = film %>% 
  left_join(inventory, by='film_id') %>% 
  left_join(rental, by='inventory_id') %>% 
  left_join(payment, by='rental_id') %>% 
  group_by(rating, film_id) %>% 
  summarise(sum=sum(amount, na.rm=TRUE)) %>% 
  ungroup() %>% 
  arrange(desc(sum)) %>% 
  group_by(rating) %>% 
  filter(row_number()==1) %>% 
  left_join(film, by='film_id') %>% 
  select(film_id, title, rating.x, sum) 

as.data.frame(df_r)

  film_id             title rating.x    sum
1     879  Telegraph Voyage       PG 215.75
2    1000         Zorro Ark    NC-17 199.72
3     460    Innocent Usual    PG-13 191.74
4     764    Saturday Lambs        G 190.74
5     938 Velvet Terminator        R 152.77

Hopefully, the examples above show that a significant part of the syntax used to wrangle structured data (data arranged in tables), in both Python (pandas) and R (tidyverse), has been influenced by SQL.

Many of the pipelines we built are comprised of the same steps as the corresponding SQL query. This is especially true for basic queries where we do simple things like group by columns or build a simple aggregate measure. Similarly, syntactic differences aside, joining operations work in a comparable way across the three languages.

As we consider more complex operations, however, a direct port of the SQL querie to the other languages may not be the most optimal way. We saw a small example of this when we computed which days in March had no rentals. The good news, however, is that these languages aren’t mutually exclusive. After all, we did run all our SQL via Python (pandas). Rather these languages complement each other and can be used in conjunction to build pipelines that can perform complex manipulation of the data in an efficient way. Very often, SQL is used to first extract a subset of the data from a large database, which can then be manipulated and transformed further in Python or R.

To steal a line from Alexandre Dumas’ The Three Musketeers, this is truly a case of All for one and one for all!

python sql r pandas tidyverse data wrangling

Introduction to regular expressions

Posted on January 30, 2019 | 18 minutes | 3625 words | Adel Rahmani

Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions".
Now they have two problems.

1. Prelude:

Regular expressions are powerful...

...but at first, they can be puzzling.

from math import *
import numpy as np
import pandas as pd
from pathlib import Path

%matplotlib inline
import matplotlib.pyplot as plt

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

2. Introduction

A regular expression (regex) is a sequence of literal characters, and metacharacters, which defines search patterns.

Most programming languages have some implementation of regular expressions, however, their syntax may vary.

One basic example is the asterisk used as a wildcard by most file systems to denote any number of characters.

For instance, *.txt denotes all files with a .txt extension.

The most elementary search pattern is the one that consists of the very characters you're looking for.

The find method of strings does just that.

s = "To be or not to be"
print(s.find('not'), s.find('question'))

9 -1

For a more general, and powerful approach to pattern searching, Python has the re module

import re

We’ll use the opening paragraphs from A Tale of Two Cities as an example.

dickens = '''
It was the best of times, it was the worst of times, 
it was the age of wisdom, it was the age of foolishness, it was the epoch of belief,
it was the epoch of incredulity, it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair, we had everything before us, we had
nothing before us, we were all going direct to Heaven, we were all going direct the other way -
in short, the period was so far like the present period, that some of its noisiest authorities
insisted on its being received, for good or for evil, in the superlative degree of comparison only.

There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with
a fair face, on the throne of France. In both countries it was clearer
than crystal to the lords of the State preserves of loaves and fishes,
that things in general were settled for ever.

It was the year of Our Lord one thousand seven hundred and seventy-five.
Spiritual revelations were conceded to England at that favoured period,
as at this. Mrs. Southcott had recently attained her five-and-twentieth
blessed birthday, of whom a prophetic private in the Life Guards had
heralded the sublime appearance by announcing that arrangements were
made for the swallowing up of London and Westminster. Even the Cock-lane
ghost had been laid only a round dozen of years, after rapping out its
messages, as the spirits of this very year last past (supernaturally
deficient in originality) rapped out theirs. Mere messages in the
earthly order of events had lately come to the English Crown and People,
from a congress of British subjects in America: which, strange
to relate, have proved more important to the human race than any
communications yet received through any of the chickens of the Cock-lane
brood.
'''

A. Literals

The search method of the re module returns a _sre.SRE_Match object which has some useful properties.

result = re.search('times', dickens)

print(type(result))
print(*[item for item in dir(result) if not item.startswith('_')], sep='\n')

<class '_sre.SRE_Match'>
end
endpos
expand
group
groupdict
groups
lastgroup
lastindex
pos
re
regs
span
start
string

The result of the search will tell us if, and where the pattern is found in the string.

print("result.span(): ", result.span())

print("result.group():", result.group())

s = result.span()

print(dickens[s[0]:s[1]])

result.span():  (20, 25)
result.group(): times
times

There are other useful functions for finding a pattern beside the search function.

Remark:

If a search pattern is to be reused, it is faster to compile it first. We'll do that henceforth.

pattern = re.compile(r'times')

# only matches the beginning of the string.
result = pattern.match(dickens)
print('match:  ', result)

# search for first match
result = pattern.search(dickens)
print('search: ', result.group())

# returns a list of all matches.
result = pattern.findall(dickens)
print('findall:', result)

match:   None
search:  times
findall: ['times', 'times']

Note:

What we've used above is an example of search pattern involving literals. The pattern was the very string we were looking for.

Notice that we didn't have to worry about punctuation. The pattern is the only sub-string returned by the search or findall function, if it is found.

B. Character classes

Sometimes we’d like to consider variants of a string. For, instance, "It" and "it".

This means that we need to look for either an uppercase “i” or a lowercase one, followed by the letter “t”.

Character classes are specified using square brackets.

pattern = re.compile(r'[Ii]t')

result = pattern.findall(dickens)

print(result)

['It', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'It', 'it', 'it', 'it', 'it', 'it']

This is a simple example of character set.

[Ii] means match any one of the characters between the square brackets.

Using character sets allows us to greatly simplify our syntax.

Python has some predefined character sets which help us write compact code.

Character Class	Matches
`[A-Z]`	any single letter of the Latin alphabet in uppercase
`[a-z]`	any single letter of the Latin alphabet in uppercase
`[A-z]`	any single letter of the Latin alphabet in either lowercase or uppercase
`[0-9]`	any single digit between 0 and 9
`[^0-9]`	any character except for single digit between 0 and 9

Here's an example:

re.findall(r'[0-9]', "Today is the 23rd of May")

['2', '3']

re.findall(r'[A-z]', "Today is the 23rd of May")

['T',
 'o',
 'd',
 'a',
 'y',
 'i',
 's',
 't',
 'h',
 'e',
 'r',
 'd',
 'o',
 'f',
 'M',
 'a',
 'y']

re.findall(r'[^A-z ]', "Today is the 23rd of May")

['2', '3']

re.findall(r'[^0-9]', "Today is the 23rd of May")

['T',
 'o',
 'd',
 'a',
 'y',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'r',
 'd',
 ' ',
 'o',
 'f',
 ' ',
 'M',
 'a',
 'y']

Back to our long string.

If we were searching for a substring made of any single uppercase letter followed by any single lower case letter we could write:

pattern = re.compile(r'[A-Z][a-z]')
result = pattern.findall(dickens)
print(result)

['It', 'Li', 'Da', 'He', 'Th', 'En', 'Fr', 'In', 'St', 'It', 'Ou', 'Lo', 'Sp', 'En', 'Mr', 'So', 'Li', 'Gu', 'Lo', 'We', 'Ev', 'Co', 'Me', 'En', 'Cr', 'Pe', 'Br', 'Am', 'Co']

Note:

Notice that except for a couple of elements in the list, we didn't get entire words.

The regular expression only specifies a pattern with one uppercase letter, followed by one lower case letter.

That's usually not what you're after. You'd probably want to extract whole words which start with an uppercase letter, followed by any number of lowercase letters (capitalised words).

Maybe all we need is to add more [a-z] sets to our patterns? Let’s try that…

pattern = re.compile(r'[A-Z][a-z][a-z]')
result = pattern.findall(dickens)
print(result)

['Lig', 'Dar', 'Hea', 'The', 'Eng', 'Fra', 'Sta', 'Our', 'Lor', 'Spi', 'Eng', 'Mrs', 'Sou', 'Lif', 'Gua', 'Lon', 'Wes', 'Eve', 'Coc', 'Mer', 'Eng', 'Cro', 'Peo', 'Bri', 'Ame', 'Coc']

That’s not really what we want.

Sure we’ve now got one more lowercase letter after the first, uppercase one, but we still don’t have whole words, unless our string contains capitalised, three-letter words. To top it off, we’ve now lost "It"!

In other words, unless you want to only extract capitalised words, with a specific number of letter, this is not the way to go.

Actually, even in that case, this is not a smart way to do it.

If we wanted to find all the capitalised words that are, say, 10 letters long, we don’t really want to have to type a pattern such as:

[A-Z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]

We need a way to specify that we want one or more lowercase letters.

For this we need more than literals or character sets, we need metacharacters.

C. Metacharacters

Literals are characters which are simply part of the pattern we are looking for.

Metacharacters, on the other hand, act like modifiers. They change how the literals, or character classes are handled.

By default each literal is matched only once. By using the + symbol, any character or character class appearing just before the metacharacter will be matched one or more times.

There are several modifiers that we can use.

Modifier	Number of occurences
`+`	one or more
`*`	zero or more
`?`	zero or one
`{m}`	`m` times
`{m, n}`	between `m` and `n` times

For instance, if we wanted to extract a list of the words that are capitalised in a string, in the past we may have written something like this:

print([word for word in dickens.strip().split() if word[0].isupper() and word[1].islower()])

['It', 'Light,', 'Darkness,', 'Heaven,', 'There', 'England;', 'France.', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs.', 'Southcott', 'Life', 'Guards', 'London', 'Westminster.', 'Even', 'Cock-lane', 'Mere', 'English', 'Crown', 'People,', 'British', 'America:', 'Cock-lane']

Notice, however, that some of the words have punctuation symbols attached to them.

No big deal, we know how to deal with this.

import string 
print([word.strip(string.punctuation) for word in dickens.split() if word[0].isupper() and word[1].islower()])

['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Cock-lane', 'Mere', 'English', 'Crown', 'People', 'British', 'America', 'Cock-lane']

That worked fine, however the syntax is a bit unwieldy.

Another version could be.

import string 
print([word.strip(string.punctuation) for word in dickens.strip().split() 
 if word.istitle()])

['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Mere', 'English', 'Crown', 'People', 'British', 'America']

Notice, however, that we lost a word…

Using a regular expression, we can specify character sets which match our pattern.

pattern = re.compile(r'[A-Z][a-z]+')
result = pattern.findall(dickens)
print(result)

['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Cock', 'Mere', 'English', 'Crown', 'People', 'British', 'America', 'Cock']

Notice that with this pattern, hyphenated words are split and only the first part is returned. Let’s handle this case.

pattern = re.compile(r'[A-Z][a-z]+-?[a-z]*')
result = pattern.findall(dickens)
print(result)

['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Cock-lane', 'Mere', 'English', 'Crown', 'People', 'British', 'America', 'Cock-lane']

D. Other built-in character classes and metacharacters

Class	Matches
`.`	any character except `\n`
`\d`	Any numeric character
`\D`	Non-numeric character
`\w`	alphanumeric characters (same as `[0-9a-zA-Z_]`)
`\W`	Non-alphanumeric characters
`\b`	word boundary
`\s`	whitespace character (including `\n`, `\t`)
`\S`	Non-whitespace character
`^`	Start of line
`$`	End of line

First word of each sentence.

We use re.MULTILINE to have the patterned searched across more than one line.

result = re.findall(r'^\w+', dickens, re.MULTILINE)
print(result)

['It', 'it', 'it', 'it', 'nothing', 'in', 'insisted', 'There', 'throne', 'a', 'than', 'that', 'It', 'Spiritual', 'as', 'blessed', 'heralded', 'made', 'ghost', 'messages', 'deficient', 'earthly', 'from', 'to', 'communications', 'brood']

Last word of each sentence.

This is a bit more tricky. First note the following:

result = re.findall(r'\w+$', dickens, re.MULTILINE)
print(result)

['had', 'authorities', 'the', 'with', 'clearer', 'twentieth', 'had', 'were', 'lane', 'its', 'supernaturally', 'the', 'strange', 'any', 'lane']

The problem here is that our pattern will only consider a word as being a match if it is at the end of a line if there is no other character after it.

Let’s try to include the possibility of a punctuation symbol.

result = re.findall(r'\w+.?$', dickens, re.MULTILINE)
print(result)

['belief,', 'Darkness,', 'had', 'authorities', 'only.', 'the', 'with', 'clearer', 'fishes,', 'ever.', 'five.', 'period,', 'twentieth', 'had', 'were', 'lane', 'its', 'supernaturally', 'the', 'People,', 'strange', 'any', 'lane', 'brood.']

That’s better but we don’t actually want the punctuation symbols to appear in the result.

E. Capture Groups

We can use a capture group to specify which part of the pattern should be returned as a group, using parentheses.

Let’s first see how this works on a simple example.

No groups

We get a list back.

re.findall(r'\w+\s\d+\w{0,2}', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")

['November 9', 'at 5pm', 'November 12', 'at 11am', 'or 4pm']

2 capture groups

We get a list of tuples with 2 elements.

re.findall(r'(\w+)\s(\d+)\w{0,2}', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")

[('November', '9'), ('at', '5'), ('November', '12'), ('at', '11'), ('or', '4')]

1 non-capture group, starting with (?: and 1 capture group.

re.findall(r'(?:\w+)\s(\d+)\w{0,2}', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")

['9', '5', '12', '11', '4']

Note:

This example is a bit lame. The same result could be achieved more simply, but it illustrates how non-capture groups work.

re.findall(r'\d+', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")

['9', '5', '12', '11', '4']

Back to our problem of finding the last words of each line.

result = re.findall(r'(\w+).?$', dickens, re.MULTILINE)
print(result)

['belief', 'Darkness', 'had', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'five', 'period', 'twentieth', 'had', 'were', 'lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'lane', 'brood']

Almost there. We just need to take care of hyphenated words and whitespaces.

# You can usually write multiple regular expressions to perform a task,
# however, it's best to try and make the regular expression as discriminating as possible.

# Method 1
result = re.findall('([\w+-?]+\w+).?[\s-]*$', dickens, re.MULTILINE)
print(result)


# Method 2
result = re.findall('([\w+-?]+\w+).?\W*$', dickens, re.MULTILINE)
print(result)


# Method 3
result = re.findall('([\w-]+)\W*$', dickens, re.MULTILINE)
print(result)

['times', 'belief', 'Darkness', 'had', 'way', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'seventy-five', 'period', 'five-and-twentieth', 'had', 'were', 'Cock-lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'Cock-lane', 'brood']
['times', 'belief', 'Darkness', 'had', 'way', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'seventy-five', 'period', 'five-and-twentieth', 'had', 'were', 'Cock-lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'Cock-lane', 'brood']
['times', 'belief', 'Darkness', 'had', 'way', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'seventy-five', 'period', 'five-and-twentieth', 'had', 'were', 'Cock-lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'Cock-lane', 'brood']

3. Data Extraction

When we're working with data you haven't generated, we usually have little control over the formatting of the data.

What's worse, the formatting can be inconsistent. This is where regular expression can be extremely useful.

Consider the data below. We'd like to extract the credit card information for each entry.

Notice that the credit card numbers are not formatted in a consistent way.

messy_data = '''
Ms. Dixie T Patenaude 18 rue Descartes STRASBOURG Alsace 67100 FR France Dixie.Patenaude@teleworm.us Shound Cheecaey3s 03.66.62.81.38 Grondin 4/15/1958 MasterCard 5379 7969 2881 8421 958 12/2017 nan 1Z 114 58A 80 2148 893 8 Blue Safety specialist Atlas Realty 2000 Subaru Outback AnalystWatch.fr O- 191.0 86.8 5' 11" 156 dd0548bb-a8b5-438d-b181-c76ad282a9a1 48.577584 7.842637
Mr. Silvano G Romani 34 Faunce Crescent WHITE TOP NSW 2675 AU Australia Silvano-Romani@einrot.com Pock1993 AeV7ziek (02) 6166 5988 Sagese 2/25/1993 MasterCard 5253-7637-4959-3303 404 06-2018 nan 1Z 814 E43 42 9322 015 2 Green Coin vending and amusement machine servicer repairer Miller & Rhoads 1998 Honda S-MX StarJock.com.au B+ 128.3 58.3 6' 2" 189 7e310daa-46f5-407e-8dda-d975715ac4d5 -33.429793 145.234214
Mr. Felix C Fried 37 Jubilee Drive CAXTON nan CB3 5WG GB United Kingdom FelixFried@rhyta.com Derser Aisequ0haz 078 1470 0903 Eisenhauer 1/19/1933 Visa 4716046346218902 738 02 2018 SP 39 75 51 D 1Z V88 635 94 7608 112 4 Blue School psychologist Wetson's 2001 Audi Allroad MoverRelocation.co.uk O+ 188.5 85.7 5' 7" 169 95515377-74a9-4c1e-9117-44b9753dad8c 51.922175 -0.353221
'''

credit_card_number = re.compile(r'(\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4})')

credit_card_number.findall(messy_data)

['5379 7969 2881 8421', '5253-7637-4959-3303', '4716046346218902']

Let’s write a regular expression called height to extract the height of each person in the data.

height = re.compile(r'(\d)\'\s+(\d{,2})"')

height.findall(messy_data)

[('5', '11'), ('6', '2'), ('5', '7')]

For more complex patterns we can use named capture groups.

A. Named Capture Groups

Let's write a regular expression that will extract not just the credit card number, but also the card type, CCV number, and expiry date.

Note:

When you are extracting several groups that correspond to a particular type of information, it's often useful to associate a name with each group. That's what named groups allow you to do.

A named group has the form (?P<name>regex).

credit_card_details = re.compile('(?P<CardType>Visa|MasterCard).*'
                                 '(?P<CardNumber>\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4})[-\s]?'
                                 '(?P<CCV>\d{3})\s+'
                                 '(?P<Expiry>\d{2}[- /]{1}\d{4})')

for line in messy_data.strip().splitlines():
    print(credit_card_details.search(line).groupdict())

{'CardType': 'MasterCard', 'CardNumber': '5379 7969 2881 8421', 'CCV': '958', 'Expiry': '12/2017'}
{'CardType': 'MasterCard', 'CardNumber': '5253-7637-4959-3303', 'CCV': '404', 'Expiry': '06-2018'}
{'CardType': 'Visa', 'CardNumber': '4716046346218902', 'CCV': '738', 'Expiry': '02 2018'}

We can get a nicer output by transforming the dictionary into a dataframe.

pd.DataFrame([credit_card_details.search(line).groupdict() for line in  messy_data.strip().splitlines()])

	CCV	CardNumber	CardType	Expiry
0	958	5379 7969 2881 8421	MasterCard	12/2017
1	404	5253-7637-4959-3303	MasterCard	06-2018
2	738	4716046346218902	Visa	02 2018

Let’s now write a regular expression email that will extract the email address from our data, and return the login and internet service provider for each entry, as two named capture groups.

email = re.compile(r'(?P<Login>\S+)@(?P<ISP>\S+)')

pd.DataFrame([email.search(line).groupdict() for line in  messy_data.strip().splitlines()])

	ISP	Login
0	teleworm.us	Dixie.Patenaude
1	einrot.com	Silvano-Romani
2	rhyta.com	FelixFried

B. Pandas & Regular expressions

For more complex data sets we can of course combine the strengths of regular expressions and Pandas.

As an example, let’s revisit the birthday formatting problem of a previous post.

path = Path("data/people.csv")
data = pd.read_csv(path, encoding='utf-8')
data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	8/25/1968	A+
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1/31/1962	A+
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1/10/1964	B+
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	4/12/1933	O-
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	11/20/1946	A+

We'd like the dates to be formatted as year-month-day.

In a previous notebook we worked out a solution using the datetime module.

Let's do this using regular expressions.

Remark:

In a previous post we did more than simply change the format of the dates. We created a special datetime objects which can be used to perform computations on dates. Here we're merely focussing on the formatting aspect to illustrate how regular expressions can be combined with our usual Pandas workflow.

# Compile the regex first for increased speed.
birthday = re.compile(r'(?P<month>\d{1,2})/(?P<day>\d{1,2})/(?P<year>\d{2,4})')

def transform_birthday(row):
    date = birthday.search(row['Birthday']).groupdict()
    return "-".join([date['year'], date['month'], date['day']])
    
data['Birthday'] = data.apply(transform_birthday, axis=1)

data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	1968-8-25	A+
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1962-1-31	A+
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1964-1-10	B+
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	1933-4-12	O-
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	1946-11-20	A+

By using Pandas’ advanced regex syntax we can achieve the same thing in one line of code.

Let’s reload the original data.

data = pd.read_csv(path, encoding='utf-8')
data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	8/25/1968	A+
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1/31/1962	A+
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1/10/1964	B+
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	4/12/1933	O-
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	11/20/1946	A+

We can extract the day, month and year of birth for each person in one line of code by passing the compiled regex to the str.extract method of our Pandas Series corresponding to the Birthday column.

data.Birthday.str.extract(birthday, expand=True).head()

	month	day	year
0	8	25	1968
1	1	31	1962
2	1	10	1964
3	4	12	1933
4	11	20	1946

Note that the names of the columns have been automatically extracted from the named groups of the regex birthday.

Using this, we can use the apply method to process the date in the format that we want.

For more clarity, we wrap our method chaining expression in parentheses, which allows us to write each method call on a new line.

data.Birthday = (data
                     .Birthday
                     .str
                     .extract(birthday, expand=True)
                     .apply(lambda date:"-".join([date['year'], 
                                                  date['month'], 
                                                  date['day']]), 
                            axis=1)
                )

data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	1968-8-25	A+
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1962-1-31	A+
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1964-1-10	B+
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	1933-4-12	O-
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	1946-11-20	A+

That being said, the best way to achieve our goal in pandas is to let it parse the dates automatically!

data = pd.read_csv(path, encoding='utf-8', parse_dates=['Birthday'])
data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	1968-08-25	A+
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1962-01-31	A+
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1964-01-10	B+
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	1933-04-12	O-
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	1946-11-20	A+

data.dtypes

GivenName                object
Surname                  object
Gender                   object
StreetAddress            object
City                     object
Country                  object
Birthday         datetime64[ns]
BloodType                object
dtype: object

The End…

python regex

Pandas (Blood)groupby

Posted on November 15, 2018 | 37 minutes | 7831 words | Adel Rahmani

from math import *
import numpy as np
import pandas as pd
from pathlib import Path

%matplotlib inline
import matplotlib.pyplot as plt

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

1. Pandas GroupBy

To illustrate how the pandas groupby operation works let’s use some fake data. With numpy it’s trivial to generate random numerical data, however, it’s usually a lot more tedious to generate random people data that looks realistic.

A very useful tool for this is the Fake Name Generator. That’s what I used to generate this dataset.

Let’s load the dataset into a dataframe.

input_path = Path('data/people.csv')

data = pd.read_csv(input_path, encoding='utf-8')

data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	8/25/1968	A+
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1/31/1962	A+
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1/10/1964	B+
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	4/12/1933	O-
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	11/20/1946	A+

The data contains information about fictitious people.

data.count()

GivenName        5000
Surname          5000
Gender           5000
StreetAddress    5000
City             5000
Country          5000
Birthday         5000
BloodType        5000
dtype: int64

Remark:

We've got different types of variables here. For instance, Gender is a categorical variable, so are BloodType and Country.

Birthday can be thought of as an interval variable, although we'll convert it to an Age variable, which is more convenient for our purpose, and will be treated as a continuous numeric variable.

Notice, however, that by default pandas will identify non-numeric data as a generic object, which is not the most efficient way of handling categorical data. Let's fix that.

data.dtypes

GivenName        object
Surname          object
Gender           object
StreetAddress    object
City             object
Country          object
Birthday         object
BloodType        object
dtype: object

data['Gender']  = data['Gender'].astype('category')
data['Country'] = data['Country'].astype('category')
data['BloodType']  = data['BloodType'].astype('category')
data.dtypes

GivenName          object
Surname            object
Gender           category
StreetAddress      object
City               object
Country          category
Birthday           object
BloodType        category
dtype: object

A. Dealing with dates.

Method 1 - `datetime` module.

One of the features (columns) in our dataset is Birthday. While this information may be useful if you wanted to find out, for instance, how many people share a birthday, most of the time we mainly care about their age.

One way to convert the Birthday into an Age feature would be to extract the year and compute the current year minus the birthday year. The split method of string objects could be used for that, however, there’s a more elegant, and general way of handling dates in Python using the datetime.strptime function.

Remark:

This isn't necessarily the fastest, or best way to handle date and time objects in pandas.

from datetime import datetime

t1 = datetime.strptime("21/01/2019", "%d/%m/%Y")
print(t1.year)
print(datetime.today()-t1)

2019
62 days, 20:14:58.495945

The datetime module allows you to manipulate date objects and perform basic operations on dates. It also allows you to format dates in a consistent way.

We can apply this to our Birthday feature.

data.Birthday[0], data.Birthday[1]

('8/25/1968', '1/31/1962')

datetime.strptime(data.Birthday[0],"%m/%d/%Y")

datetime.datetime(1968, 8, 25, 0, 0)

datetime.strptime(data.Birthday[1],"%m/%d/%Y")

datetime.datetime(1962, 1, 31, 0, 0)

Let’s use the Pandas `apply` method to format the dates in a consistent way

data.Birthday = data.apply(lambda row: datetime.strptime(row['Birthday'], "%m/%d/%Y"), axis='columns')

data.Birthday.head()

0   1968-08-25
1   1962-01-31
2   1964-01-10
3   1933-04-12
4   1946-11-20
Name: Birthday, dtype: datetime64[ns]

Method 2 - using the `to_datetime` method of pandas.

original_data = pd.read_csv(input_path, encoding='utf-8')
original_data.Birthday = pd.to_datetime(original_data.Birthday)
original_data.Birthday.head()

0   1968-08-25
1   1962-01-31
2   1964-01-10
3   1933-04-12
4   1946-11-20
Name: Birthday, dtype: datetime64[ns]

Method 3 - convert at reading time.

original_data = pd.read_csv(input_path, encoding='utf-8', parse_dates=['Birthday'])
original_data.Birthday.head()

0   1968-08-25
1   1962-01-31
2   1964-01-10
3   1933-04-12
4   1946-11-20
Name: Birthday, dtype: datetime64[ns]

We can now define an `Age` feature by subtracting the year of birth from the current year.

Remark:

Because we are not hard-coding (typing by hand) the current date, if we come back to this code in a year's time and run it again, the Age will be updated automatically.

We could compute the age using `apply` to iterate through the rows of the dataframe…

data.apply(lambda row: datetime.now().year - row['Birthday'].year, axis='columns').head()

0    51
1    57
2    55
3    86
4    73
dtype: int64

However, it is usually faster to operate on the whole dataframe at one.

Datetime methods on a pandas series (such as a column of a dataframe), can be accessed via the dt method.

(datetime.now().year - data.Birthday.dt.year).head()

0    51
1    57
2    55
3    86
4    73
Name: Birthday, dtype: int64

data['Age'] = datetime.now().year - data.Birthday.dt.year

data.head()

	GivenName	Surname	Gender	StreetAddress	City	Country	Birthday	BloodType	Age
0	Stepanida	Sukhorukova	female	62 Ockham Road	EASTER WHYNTIE	United Kingdom	1968-08-25	A+	51
1	Hiệu	Lương	male	4 Iolaire Road	NEW CROSS	United Kingdom	1962-01-31	A+	57
2	Petra	Neudorf	female	56 Victoria Road	LISTON	United Kingdom	1964-01-10	B+	55
3	Eho	Amano	female	83 Stroud Rd	OGMORE	United Kingdom	1933-04-12	O-	86
4	Noah	Niland	male	61 Wrexham Rd	FACEBY	United Kingdom	1946-11-20	A+	73

B. GroupBy feature.

Given the data, common questions to answer would be how certain features are distributed in each country, or for each gender.

These could be as simple as What is the average age in each country? to much more complex questions such as How many people of each blood type are there in each country, for each gender, for a given age group?

Fortunately, Pandas’ GroupBy method allows us to organise the data in ways only limited by our sagacity.

Let's look at the country distribution first.

data.Country.value_counts()

Australia         1800
Italy              800
United States      700
United Kingdom     500
New Zealand        500
France             500
Iceland            200
Name: Country, dtype: int64

data.Country.value_counts(ascending=True).plot(kind='barh', color='g', alpha=0.5);

png

The `groupby` method creates a `GroupBy` object.

The groupby object is a recipe for how to perform operation on the data.

To use a groupby object, we need to perform some operation on it.

grouped_by_country = data.groupby('Country')
grouped_by_country

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x106ca5710>

grouped_by_country.size()

Country
Australia         1800
France             500
Iceland            200
Italy              800
New Zealand        500
United Kingdom     500
United States      700
dtype: int64

grouped_by_country.ngroups

for group in grouped_by_country.groups:
    print(group)

Australia
France
Iceland
Italy
New Zealand
United Kingdom
United States

grouped_by_country['Age'].mean()

Country
Australia         57.817778
France            57.944000
Iceland           57.660000
Italy             56.607500
New Zealand       56.630000
United Kingdom    57.740000
United States     57.415714
Name: Age, dtype: float64

Remark:

Recall that the apply method allows you to apply an arbitrary function to all the rows in your dataframe. Therefore, as long as you can express your operations as a function (lambda or otherwise), you can include it in the apply, even if your function returns multiple values, provided you wrap them in a tuple.

grouped_by_country['Age'].apply(lambda x: (np.min(x), 
                                           f'{x.mean():0.2f}', 
                                           np.max(x), 
                                           f'{x.std():0.2f}')
                                )

Country
Australia         (24, 57.82, 91, 19.49)
France            (24, 57.94, 91, 19.54)
Iceland           (24, 57.66, 91, 19.46)
Italy             (24, 56.61, 91, 18.98)
New Zealand       (24, 56.63, 91, 19.05)
United Kingdom    (24, 57.74, 91, 19.53)
United States     (24, 57.42, 91, 19.29)
Name: Age, dtype: object

A different and nicer output can be obtained using the `agg` method on a groupby object.

grouped_by_country['Age'].agg(['min','mean','max','std'])

	min	mean	max	std
Country
Australia	24	57.817778	91	19.488031
France	24	57.944000	91	19.541869
Iceland	24	57.660000	91	19.461194
Italy	24	56.607500	91	18.981275
New Zealand	24	56.630000	91	19.049905
United Kingdom	24	57.740000	91	19.527905
United States	24	57.415714	91	19.285003

For categorical variables, such as `BloodType`, basic information can be extracted using the `describe` method.

grouped_by_country['BloodType'].describe()

	count	unique	top	freq
Country
Australia	1800	8	O+	648
France	500	8	O+	184
Iceland	200	7	O+	86
Italy	800	8	O+	299
New Zealand	500	8	O+	185
United Kingdom	500	8	O+	188
United States	700	8	O+	273

grouped_by_country['GivenName'].describe()

	count	unique	top	freq
Country
Australia	1800	1368	Michael	9
France	500	444	Anna	4
Iceland	200	193	Dieter	2
Italy	800	677	Jennifer	6
New Zealand	500	447	James	4
United Kingdom	500	450	Ellis	3
United States	700	620	Lily	5

describe only tells us about the most frequent blood type. To get a count of the boodtypes let us use a Counter object.

Which item appears most frequently?

Counting the number of objects of a given type is such a common operation that Python has a very useful Counter object from the collections module. What a Counter object does to a sequence of items is group the elements of the sequence into bins according to their value, and count how many element each bin has. A Counter object also had several useful properties, for instance, to determine the most common object in the sequence.

from collections import Counter

L = ['a', 'b', 'c', 'a', 'a', 'c', 'b', 'b', 'b', 'b']
c = Counter(L)
c

Counter({'a': 3, 'b': 5, 'c': 2})

print(c.most_common())
print(c.most_common(2))
print(c.most_common(1))

[('b', 5), ('a', 3), ('c', 2)]
[('b', 5), ('a', 3)]
[('b', 5)]

grouped_by_country['BloodType'].apply(Counter)

Country            
Australia       A+     468.0
                A-      59.0
                AB+    100.0
                AB-      8.0
                B+     419.0
                B-      24.0
                O+     648.0
                O-      74.0
France          A+     145.0
                A-      17.0
                AB+     19.0
                AB-      1.0
                B+     106.0
                B-       4.0
                O+     184.0
                O-      24.0
Iceland         A+      45.0
                A-       6.0
                AB+      5.0
                B+      44.0
                B-       5.0
                O+      86.0
                O-       9.0
Italy           A+     207.0
                A-      31.0
                AB+     37.0
                AB-      5.0
                B+     185.0
                B-       4.0
                O+     299.0
                O-      32.0
New Zealand     A+     154.0
                A-       9.0
                AB+     21.0
                AB-      5.0
                B+     100.0
                B-       5.0
                O+     185.0
                O-      21.0
United Kingdom  A+     132.0
                A-      23.0
                AB+     25.0
                AB-      4.0
                B+     102.0
                B-       7.0
                O+     188.0
                O-      19.0
United States   A+     157.0
                A-      25.0
                AB+     35.0
                AB-      4.0
                B+     157.0
                B-      15.0
                O+     273.0
                O-      34.0
Name: BloodType, dtype: float64

Remark:

Note how the result is one long column of information. This result is actually a Pandas Series object (recall that a Series object is essentially an array with an index).

It may look a bit like a dataframe, but remember that in a dataframe each column corresponds to a unique feature. Here, what looks like the second column is actually a second level for the index. We'll talk about multi-index shortly.

The result as it stands may not be the easiest to read. It would be better if we could transform it back into a dataframe so that we could compare the results for two countries more easily. In Pandas, you can do that with one command: unstack.

grouped_by_country['BloodType'].apply(Counter).unstack()

	A+	A-	AB+	AB-	B+	B-	O+	O-
Country
Australia	468.0	59.0	100.0	8.0	419.0	24.0	648.0	74.0
France	145.0	17.0	19.0	1.0	106.0	4.0	184.0	24.0
Iceland	45.0	6.0	5.0	NaN	44.0	5.0	86.0	9.0
Italy	207.0	31.0	37.0	5.0	185.0	4.0	299.0	32.0
New Zealand	154.0	9.0	21.0	5.0	100.0	5.0	185.0	21.0
United Kingdom	132.0	23.0	25.0	4.0	102.0	7.0	188.0	19.0
United States	157.0	25.0	35.0	4.0	157.0	15.0	273.0	34.0

If we want to switch the index and the columns around, we need to specify the level parameter in unstack() which, by default, is -1, that is the last (innermost) level of the index.

grouped_by_country['BloodType'].apply(Counter).unstack(level=0)

Country	Australia	France	Iceland	Italy	New Zealand	United Kingdom	United States
A+	468.0	145.0	45.0	207.0	154.0	132.0	157.0
A-	59.0	17.0	6.0	31.0	9.0	23.0	25.0
AB+	100.0	19.0	5.0	37.0	21.0	25.0	35.0
AB-	8.0	1.0	NaN	5.0	5.0	4.0	4.0
B+	419.0	106.0	44.0	185.0	100.0	102.0	157.0
B-	24.0	4.0	5.0	4.0	5.0	7.0	15.0
O+	648.0	184.0	86.0	299.0	185.0	188.0	273.0
O-	74.0	24.0	9.0	32.0	21.0	19.0	34.0

The level can also be specified by name

grouped_by_country['BloodType'].apply(Counter).unstack(level='Country')

Country	Australia	France	Iceland	Italy	New Zealand	United Kingdom	United States
A+	468.0	145.0	45.0	207.0	154.0	132.0	157.0
A-	59.0	17.0	6.0	31.0	9.0	23.0	25.0
AB+	100.0	19.0	5.0	37.0	21.0	25.0	35.0
AB-	8.0	1.0	NaN	5.0	5.0	4.0	4.0
B+	419.0	106.0	44.0	185.0	100.0	102.0	157.0
B-	24.0	4.0	5.0	4.0	5.0	7.0	15.0
O+	648.0	184.0	86.0	299.0	185.0	188.0	273.0
O-	74.0	24.0	9.0	32.0	21.0	19.0	34.0

Note that we have a bunch of NaN in our dataframe. This indicates that no one from Iceland has blood type AB+ in our dataset.

Since there was no such data in our dataset, it appears as a missing value.

However, here, a value of 0 would be more appropriate. We can tell the dataframe to replace its missing values by 0 using the fillna method.

Since we’re dealing with count data, it’s also a good idea to convert the type to int.

grouped_by_country['BloodType'].apply(Counter).unstack().fillna(0).astype(int)

	A+	A-	AB+	AB-	B+	B-	O+	O-
Country
Australia	468	59	100	8	419	24	648	74
France	145	17	19	1	106	4	184	24
Iceland	45	6	5	0	44	5	86	9
Italy	207	31	37	5	185	4	299	32
New Zealand	154	9	21	5	100	5	185	21
United Kingdom	132	23	25	4	102	7	188	19
United States	157	25	35	4	157	15	273	34

Remark:

We are using fake data so don't give too much credence to these results.

This being said, blood type frequencies do vary across countries and you can read more about it here.

C. GroupBy on multiple features.

Of, course, GroupBy operations aren’t limited to single features.

If we want to see the data grouped by country and blood type , we just need to specify both features in constructing the GroupBy object.

Note that with a `groupby` on multiple features we can obtain the previous result in a different way.

data.groupby(['Country','BloodType']).count().iloc[:, 0].unstack().fillna(0).astype(int)

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Country
Australia	468	59	100	8	419	24	648	74
France	145	17	19	1	106	4	184	24
Iceland	45	6	5	0	44	5	86	9
Italy	207	31	37	5	185	4	299	32
New Zealand	154	9	21	5	100	5	185	21
United Kingdom	132	23	25	4	102	7	188	19
United States	157	25	35	4	157	15	273	34

Let’s group our data based on country and gender.

grouped_by_country_and_gender = data.groupby(['Country', 'Gender'])

grouped_by_country_and_gender['BloodType'].apply(Counter).unstack()

		A+	A-	AB+	AB-	B+	B-	O+	O-
Country	Gender
Australia	female	229.0	29.0	54.0	4.0	205.0	13.0	328.0	41.0
Australia	male	239.0	30.0	46.0	4.0	214.0	11.0	320.0	33.0
France	female	76.0	11.0	10.0	NaN	47.0	2.0	79.0	16.0
France	male	69.0	6.0	9.0	1.0	59.0	2.0	105.0	8.0
Iceland	female	22.0	5.0	2.0	NaN	17.0	4.0	34.0	8.0
Iceland	male	23.0	1.0	3.0	NaN	27.0	1.0	52.0	1.0
Italy	female	102.0	14.0	22.0	1.0	89.0	1.0	160.0	17.0
Italy	male	105.0	17.0	15.0	4.0	96.0	3.0	139.0	15.0
New Zealand	female	75.0	5.0	8.0	1.0	53.0	2.0	91.0	9.0
New Zealand	male	79.0	4.0	13.0	4.0	47.0	3.0	94.0	12.0
United Kingdom	female	65.0	6.0	14.0	1.0	55.0	4.0	99.0	13.0
United Kingdom	male	67.0	17.0	11.0	3.0	47.0	3.0	89.0	6.0
United States	female	89.0	13.0	18.0	3.0	77.0	7.0	138.0	12.0
United States	male	68.0	12.0	17.0	1.0	80.0	8.0	135.0	22.0

Once again let’s replace missing values by 0.

grouped_by_country_and_gender['BloodType'].apply(Counter).unstack().fillna(0).astype(int)

		A+	A-	AB+	AB-	B+	B-	O+	O-
Country	Gender
Australia	female	229	29	54	4	205	13	328	41
Australia	male	239	30	46	4	214	11	320	33
France	female	76	11	10	0	47	2	79	16
France	male	69	6	9	1	59	2	105	8
Iceland	female	22	5	2	0	17	4	34	8
Iceland	male	23	1	3	0	27	1	52	1
Italy	female	102	14	22	1	89	1	160	17
Italy	male	105	17	15	4	96	3	139	15
New Zealand	female	75	5	8	1	53	2	91	9
New Zealand	male	79	4	13	4	47	3	94	12
United Kingdom	female	65	6	14	1	55	4	99	13
United Kingdom	male	67	17	11	3	47	3	89	6
United States	female	89	13	18	3	77	7	138	12
United States	male	68	12	17	1	80	8	135	22

grouped_by_country_and_gender.mean()

		Age
Country	Gender
Australia	female	58.101883
Australia	male	57.531773
France	female	59.207469
France	male	56.768340
Iceland	female	57.315217
Iceland	male	57.953704
Italy	female	57.300493
Italy	male	55.893401
New Zealand	female	55.823770
New Zealand	male	57.398438
United Kingdom	female	57.284047
United Kingdom	male	58.222222
United States	female	57.299720
United States	male	57.536443

Notice how we didn’t specify the Age feature. Pandas automatically outputs results for which the mean function makes sense. Here only Age fits the bill.

Like with any dataframe, you can also use apply to map a function to the grouped data, however, for anything more complex, like applying multiple functions at once, the agg method is more convenient.

grouped_by_country_and_gender['Age'].agg(['min','max','mean','std'])

		min	max	mean	std
Country	Gender
Australia	female	24	91	58.101883	19.587278
Australia	male	24	91	57.531773	19.394326
France	female	25	91	59.207469	19.041143
France	male	24	91	56.768340	19.961407
Iceland	female	24	91	57.315217	19.527343
Iceland	male	24	91	57.953704	19.490896
Italy	female	24	91	57.300493	19.116494
Italy	male	24	90	55.893401	18.838508
New Zealand	female	24	91	55.823770	19.546233
New Zealand	male	24	91	57.398438	18.570202
United Kingdom	female	24	91	57.284047	20.168651
United Kingdom	male	24	91	58.222222	18.856132
United States	female	24	91	57.299720	19.167134
United States	male	24	91	57.536443	19.434197

For descriptive statistics of numerical data, the `quantile` function is also useful.

Remark:

The 0.5 quantile is the median of our data.

grouped_by_country_and_gender['Age'].quantile((0.10, 0.25, 0.5, 0.75, 0.90)).unstack()

		0.1	0.25	0.5	0.75	0.9
Country	Gender
Australia	female	31.0	41.00	58.0	75.00	85.0
Australia	male	30.0	41.00	58.0	74.00	85.0
France	female	31.0	44.00	61.0	76.00	84.0
France	male	30.0	37.00	57.0	75.00	83.0
Iceland	female	32.1	40.75	59.0	74.00	83.9
Iceland	male	32.0	42.00	56.5	74.00	85.0
Italy	female	31.0	41.00	57.0	73.00	85.0
Italy	male	31.0	39.00	55.0	71.00	84.0
New Zealand	female	29.3	39.00	54.0	72.25	83.0
New Zealand	male	32.0	41.00	58.0	72.00	83.5
United Kingdom	female	29.6	39.00	57.0	75.00	85.0
United Kingdom	male	30.2	43.00	59.0	73.00	83.0
United States	female	31.0	41.00	57.0	74.00	84.0
United States	male	30.0	40.50	58.0	74.00	83.0

How can we use the grouped_by_country_and_gender object to output the number of people who were born on a given month (numbered 1 to 12), for each country, and for each gender?

Hint: This can be done in one line using apply and value_counts, but you need to make sure you keep track of the type of object you are dealing with at each stage of the pipeline.

# Method 1 - one liner 
(grouped_by_country_and_gender['Birthday']
            .apply(lambda x: x.dt.month.value_counts())
            .unstack()
            .fillna(0))

		1	2	3	4	5	6	7	8	9	10	11	12
Country	Gender
Australia	female	83	75	70	66	71	78	90	77	70	65	73	85
Australia	male	69	64	74	71	74	57	77	91	80	81	81	78
France	female	26	15	18	13	30	22	22	19	11	21	21	23
France	male	22	15	22	28	19	22	26	23	26	22	16	18
Iceland	female	7	11	3	6	5	10	10	12	9	4	7	8
Iceland	male	7	11	15	2	8	9	11	6	12	10	9	8
Italy	female	31	34	39	33	36	28	33	36	36	37	34	29
Italy	male	31	28	35	41	27	28	37	33	29	33	33	39
New Zealand	female	18	20	19	21	22	26	26	15	14	23	23	17
New Zealand	male	21	22	16	22	30	19	16	25	18	25	23	19
United Kingdom	female	17	21	19	22	14	20	31	22	19	24	34	14
United Kingdom	male	21	16	22	26	20	21	23	18	17	21	19	19
United States	female	38	32	35	24	37	25	25	25	32	31	21	32
United States	male	44	30	23	25	23	29	31	29	26	28	26	29

# Method 2 - Starting from the original data
# create a month series first and use it to perform groupby

months = data.Birthday.dt.month
(data
   .groupby(['Country', 'Gender', months])['Age']
   .count()
   .unstack()
   .fillna(0))

	Birthday	1	2	3	4	5	6	7	8	9	10	11	12
Country	Gender
Australia	female	83	75	70	66	71	78	90	77	70	65	73	85
Australia	male	69	64	74	71	74	57	77	91	80	81	81	78
France	female	26	15	18	13	30	22	22	19	11	21	21	23
France	male	22	15	22	28	19	22	26	23	26	22	16	18
Iceland	female	7	11	3	6	5	10	10	12	9	4	7	8
Iceland	male	7	11	15	2	8	9	11	6	12	10	9	8
Italy	female	31	34	39	33	36	28	33	36	36	37	34	29
Italy	male	31	28	35	41	27	28	37	33	29	33	33	39
New Zealand	female	18	20	19	21	22	26	26	15	14	23	23	17
New Zealand	male	21	22	16	22	30	19	16	25	18	25	23	19
United Kingdom	female	17	21	19	22	14	20	31	22	19	24	34	14
United Kingdom	male	21	16	22	26	20	21	23	18	17	21	19	19
United States	female	38	32	35	24	37	25	25	25	32	31	21	32
United States	male	44	30	23	25	23	29	31	29	26	28	26	29

# Note that although we've used 'Age' here, any feature would have done. 
# Do you see why?

# Method 3 - Starting from the original data and assign a new month column
(data
   .assign(month=data.Birthday.dt.month)
   .groupby(['Country', 'Gender', 'month'])['Age']
   .count()
   .unstack()
   .fillna(0))

	month	1	2	3	4	5	6	7	8	9	10	11	12
Country	Gender
Australia	female	83	75	70	66	71	78	90	77	70	65	73	85
Australia	male	69	64	74	71	74	57	77	91	80	81	81	78
France	female	26	15	18	13	30	22	22	19	11	21	21	23
France	male	22	15	22	28	19	22	26	23	26	22	16	18
Iceland	female	7	11	3	6	5	10	10	12	9	4	7	8
Iceland	male	7	11	15	2	8	9	11	6	12	10	9	8
Italy	female	31	34	39	33	36	28	33	36	36	37	34	29
Italy	male	31	28	35	41	27	28	37	33	29	33	33	39
New Zealand	female	18	20	19	21	22	26	26	15	14	23	23	17
New Zealand	male	21	22	16	22	30	19	16	25	18	25	23	19
United Kingdom	female	17	21	19	22	14	20	31	22	19	24	34	14
United Kingdom	male	21	16	22	26	20	21	23	18	17	21	19	19
United States	female	38	32	35	24	37	25	25	25	32	31	21	32
United States	male	44	30	23	25	23	29	31	29	26	28	26	29

D. MultiIndex.

We can of course group the data by more than 2 features.

However, results might get a bit harder to take in.

s = data.groupby(['Country','Gender','BloodType'])['Age'].mean().round(2)
s

Country         Gender  BloodType
Australia       female  A+           59.36
                        A-           59.90
                        AB+          53.13
                        AB-          61.50
                        B+           56.62
                        B-           66.00
                        O+           58.46
                        O-           58.02
                male    A+           57.67
                        A-           51.70
                        AB+          59.98
                        AB-          62.00
                        B+           58.79
                        B-           44.73
                        O+           56.94
                        O-           59.76
France          female  A+           58.54
                        A-           53.09
                        AB+          63.40
                        B+           58.47
                        B-           61.50
                        O+           60.18
                        O-           61.06
                male    A+           60.81
                        A-           56.00
                        AB+          62.78
                        AB-          41.00
                        B+           55.03
                        B-           67.00
                        O+           53.90
                                     ...  
United Kingdom  female  AB+          49.79
                        AB-          63.00
                        B+           55.33
                        B-           65.50
                        O+           57.16
                        O-           63.23
                male    A+           57.39
                        A-           60.41
                        AB+          53.91
                        AB-          65.33
                        B+           59.68
                        B-           32.67
                        O+           58.37
                        O-           64.83
United States   female  A+           57.46
                        A-           71.54
                        AB+          58.44
                        AB-          53.67
                        B+           55.75
                        B-           65.57
                        O+           57.57
                        O-           41.83
                male    A+           55.07
                        A-           47.08
                        AB+          55.12
                        AB-          66.00
                        B+           54.34
                        B-           71.25
                        O+           61.33
                        O-           55.73
Name: Age, Length: 109, dtype: float64

Remark:

s is a Pandas Series object with a MultiIndex object as its index.

s.index

MultiIndex(levels=[['Australia', 'France', 'Iceland', 'Italy', 'New Zealand', 'United Kingdom', 'United States'], ['female', 'male'], ['A+', 'A-', 'AB+', 'AB-', 'B+', 'B-', 'O+', 'O-']],
           codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 4, 5, 6, 7, 0, 1, 2, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7]],
           names=['Country', 'Gender', 'BloodType'])

We can convert the MultiIndexed series to a MultiIndexed dataframe using unstack.

df = s.unstack().fillna(0)
df

	BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Country	Gender
Australia	female	59.36	59.90	53.13	61.50	56.62	66.00	58.46	58.02
Australia	male	57.67	51.70	59.98	62.00	58.79	44.73	56.94	59.76
France	female	58.54	53.09	63.40	0.00	58.47	61.50	60.18	61.06
France	male	60.81	56.00	62.78	41.00	55.03	67.00	53.90	65.62
Iceland	female	56.95	63.40	63.00	0.00	60.06	67.25	53.15	60.00
Iceland	male	62.09	35.00	57.00	0.00	59.37	74.00	55.13	81.00
Italy	female	55.68	55.36	68.50	46.00	56.07	57.00	57.96	55.12
Italy	male	54.45	54.47	62.60	44.75	54.82	66.00	57.32	55.47
New Zealand	female	57.49	59.20	50.50	67.00	58.19	45.00	53.22	58.33
New Zealand	male	60.39	74.75	65.08	64.00	53.96	52.33	54.68	57.42
United Kingdom	female	58.42	63.17	49.79	63.00	55.33	65.50	57.16	63.23
United Kingdom	male	57.39	60.41	53.91	65.33	59.68	32.67	58.37	64.83
United States	female	57.46	71.54	58.44	53.67	55.75	65.57	57.57	41.83
United States	male	55.07	47.08	55.12	66.00	54.34	71.25	61.33	55.73

To access specific data using the MultiIndex for a Series, you can use the xs method and specify the level at which you are extracting the data, if you are not using the top level.

To access specific data using the MultiIndex for a DataFrame, you can use the loc method, but specifying the level is a bit more involved.

Top level selection

s['Australia']

Gender  BloodType
female  A+           59.36
        A-           59.90
        AB+          53.13
        AB-          61.50
        B+           56.62
        B-           66.00
        O+           58.46
        O-           58.02
male    A+           57.67
        A-           51.70
        AB+          59.98
        AB-          62.00
        B+           58.79
        B-           44.73
        O+           56.94
        O-           59.76
Name: Age, dtype: float64

df.loc['Australia', :]

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Gender
female	59.36	59.9	53.13	61.5	56.62	66.00	58.46	58.02
male	57.67	51.7	59.98	62.0	58.79	44.73	56.94	59.76

This is equivalent to:

s['Australia'].unstack()

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Gender
female	59.36	59.9	53.13	61.5	56.62	66.00	58.46	58.02
male	57.67	51.7	59.98	62.0	58.79	44.73	56.94	59.76

The levels can be exchanged either by unstacking the right level, or by transposing the resulting dataframe.

s['Australia'].unstack(level='Gender')

Gender	female	male
BloodType
A+	59.36	57.67
A-	59.90	51.70
AB+	53.13	59.98
AB-	61.50	62.00
B+	56.62	58.79
B-	66.00	44.73
O+	58.46	56.94
O-	58.02	59.76

s['Australia'].unstack().T

Gender	female	male
BloodType
A+	59.36	57.67
A-	59.90	51.70
AB+	53.13	59.98
AB-	61.50	62.00
B+	56.62	58.79
B-	66.00	44.73
O+	58.46	56.94
O-	58.02	59.76

To select data at a deeper level of the multiindex, you need to specify the level.

Here’s an example with Gender.

Notice the different syntax using xs.

s.xs('female', level='Gender').unstack()

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Country
Australia	59.36	59.90	53.13	61.50	56.62	66.00	58.46	58.02
France	58.54	53.09	63.40	NaN	58.47	61.50	60.18	61.06
Iceland	56.95	63.40	63.00	NaN	60.06	67.25	53.15	60.00
Italy	55.68	55.36	68.50	46.00	56.07	57.00	57.96	55.12
New Zealand	57.49	59.20	50.50	67.00	58.19	45.00	53.22	58.33
United Kingdom	58.42	63.17	49.79	63.00	55.33	65.50	57.16	63.23
United States	57.46	71.54	58.44	53.67	55.75	65.57	57.57	41.83

Selecting the data from a deeper level is trickier with a dataframe.

We need to specify an index `slice`.

df.loc[(slice(None), 'female'), :].droplevel('Gender')

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Country
Australia	59.36	59.90	53.13	61.50	56.62	66.00	58.46	58.02
France	58.54	53.09	63.40	0.00	58.47	61.50	60.18	61.06
Iceland	56.95	63.40	63.00	0.00	60.06	67.25	53.15	60.00
Italy	55.68	55.36	68.50	46.00	56.07	57.00	57.96	55.12
New Zealand	57.49	59.20	50.50	67.00	58.19	45.00	53.22	58.33
United Kingdom	58.42	63.17	49.79	63.00	55.33	65.50	57.16	63.23
United States	57.46	71.54	58.44	53.67	55.75	65.57	57.57	41.83

or alternatively, using a pandas IndexSlice object.

idx = pd.IndexSlice

df.loc[idx[:, 'female'], :].droplevel('Gender')

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Country
Australia	59.36	59.90	53.13	61.50	56.62	66.00	58.46	58.02
France	58.54	53.09	63.40	0.00	58.47	61.50	60.18	61.06
Iceland	56.95	63.40	63.00	0.00	60.06	67.25	53.15	60.00
Italy	55.68	55.36	68.50	46.00	56.07	57.00	57.96	55.12
New Zealand	57.49	59.20	50.50	67.00	58.19	45.00	53.22	58.33
United Kingdom	58.42	63.17	49.79	63.00	55.33	65.50	57.16	63.23
United States	57.46	71.54	58.44	53.67	55.75	65.57	57.57	41.83

Same if you want to select `BloodType` as the level.

s.xs('O+', level='BloodType').unstack().fillna(0)

Gender	female	male
Country
Australia	58.46	56.94
France	60.18	53.90
Iceland	53.15	55.13
Italy	57.96	57.32
New Zealand	53.22	54.68
United Kingdom	57.16	58.37
United States	57.57	61.33

df.loc[:, ['O+']].unstack().droplevel('BloodType', axis='columns')

Gender	female	male
Country
Australia	58.46	56.94
France	60.18	53.90
Iceland	53.15	55.13
Italy	57.96	57.32
New Zealand	53.22	54.68
United Kingdom	57.16	58.37
United States	57.57	61.33

Multiple levels can be specified at once

s.xs(('Australia','female'), level=['Country','Gender'])

BloodType
A+     59.36
A-     59.90
AB+    53.13
AB-    61.50
B+     56.62
B-     66.00
O+     58.46
O-     58.02
Name: Age, dtype: float64

df.loc[('Australia', 'female'), :]

BloodType
A+     59.36
A-     59.90
AB+    53.13
AB-    61.50
B+     56.62
B-     66.00
O+     58.46
O-     58.02
Name: (Australia, female), dtype: float64

s.xs(('Australia','O+'), level=['Country','BloodType'])

Gender
female    58.46
male      56.94
Name: Age, dtype: float64

df.loc['Australia', 'O+']

Gender
female    58.46
male      56.94
Name: O+, dtype: float64

Let’s visualise some of these results

# colormap from matplotlib
from matplotlib import cm

# Also see this stackoverflow post for the legend placement: 
# http://stackoverflow.com/questions/23556153/how-to-put-legend-outside-the-plot-with-pandas

ax = (s.xs('female', level='Gender')
       .unstack()
       .fillna(0)
       .plot(kind='bar', 
             figsize=(15,5), 
             width=0.8, 
             alpha=0.8, 
             ec='k',
             rot=0,
             colormap=cm.Paired_r)
     )
ax.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

png

Notice the difference between the previous plot and the one below.

ax = (s.xs('female', level='Gender')
       .unstack('Country')
       .fillna(0)
       .plot(kind='bar', 
             figsize=(15,5), 
             width=0.8, 
             alpha=0.8, 
             ec='k',
             rot=0,
             colormap=cm.Paired_r)
     )
ax.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

png

E. Fine Tuning.

So far we’ve applied an operation on a single feature of a GroupBy object.

For instance, we’ve computed the mean for the Age feature, and then, in a separate computation we’ve computed a count of the blood types.

You can actually do everything at once by specifying a different operation for each feature using a dictionary, and use the agg method to aggregate the data across the dataframe.

group = data.groupby(['Country', 'Gender'])

def data_count(x):
    return Counter(x).items()

method = {'Age': [np.mean, np.std], 'BloodType': data_count}

group.agg(method).rename({'data_count':'group_counts'}, axis=1)

		Age		BloodType
		mean	std	group_counts
Country	Gender
Australia	female	58.101883	19.587278	((B+, 205), (O-, 41), (AB+, 54), (O+, 328), (A...
Australia	male	57.531773	19.394326	((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...
France	female	59.207469	19.041143	((A-, 11), (B+, 47), (A+, 76), (O+, 79), (O-, ...
France	male	56.768340	19.961407	((B+, 59), (A+, 69), (AB+, 9), (O+, 105), (O-,...
Iceland	female	57.315217	19.527343	((O+, 34), (A+, 22), (B-, 4), (B+, 17), (A-, 5...
Iceland	male	57.953704	19.490896	((O+, 52), (A+, 23), (AB+, 3), (B+, 27), (O-, ...
Italy	female	57.300493	19.116494	((A+, 102), (O+, 160), (B+, 89), (O-, 17), (AB...
Italy	male	55.893401	18.838508	((A+, 105), (O+, 139), (B+, 96), (AB-, 4), (B-...
New Zealand	female	55.823770	19.546233	((O-, 9), (O+, 91), (A+, 75), (AB+, 8), (B+, 5...
New Zealand	male	57.398438	18.570202	((A+, 79), (O+, 94), (AB+, 13), (B+, 47), (O-,...
United Kingdom	female	57.284047	20.168651	((A+, 65), (B+, 55), (O-, 13), (O+, 99), (AB+,...
United Kingdom	male	58.222222	18.856132	((A+, 67), (B+, 47), (O-, 6), (A-, 17), (O+, 8...
United States	female	57.299720	19.167134	((A+, 89), (B+, 77), (O+, 138), (AB+, 18), (A-...
United States	male	57.536443	19.434197	((A+, 68), (O+, 135), (B+, 80), (A-, 12), (AB+...

Accessing a particular portion of the result is just a matter of navigating the multilevel structure of the index and the columns.

group.agg(method).loc[[('Australia', 'male')]]

		Age		BloodType
		mean	std	data_count
Country	Gender
Australia	male	57.531773	19.394326	((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...

This is equivalent to

group.agg(method).xs(('Australia', 'male'), level=['Country', 'Gender'])

		Age		BloodType
		mean	std	data_count
Country	Gender
Australia	male	57.531773	19.394326	((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...

Notice the difference with

group.agg(method).xs(('Australia', 'male'))

Age        mean                                                    57.5318
           std                                                     19.3943
BloodType  data_count    ((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...
Name: (Australia, male), dtype: object

We can select data using levels of the columns.

group.agg(method).loc[:,[('Age', 'mean')]]

		Age
		mean
Country	Gender
Australia	female	58.101883
Australia	male	57.531773
France	female	59.207469
France	male	56.768340
Iceland	female	57.315217
Iceland	male	57.953704
Italy	female	57.300493
Italy	male	55.893401
New Zealand	female	55.823770
New Zealand	male	57.398438
United Kingdom	female	57.284047
United Kingdom	male	58.222222
United States	female	57.299720
United States	male	57.536443

This is equivalent to

group.agg(method).xs([('Age', 'mean')], axis=1)

		Age
		mean
Country	Gender
Australia	female	58.101883
Australia	male	57.531773
France	female	59.207469
France	male	56.768340
Iceland	female	57.315217
Iceland	male	57.953704
Italy	female	57.300493
Italy	male	55.893401
New Zealand	female	55.823770
New Zealand	male	57.398438
United Kingdom	female	57.284047
United Kingdom	male	58.222222
United States	female	57.299720
United States	male	57.536443

You can go as “deep” as you need.

group.agg(method).loc['Australia', 'female']['Age']['mean']

58.10188261351052

This is equivalent to

group.agg(method).xs(('Australia', 'female'))['Age']['mean']

58.10188261351052

To extract just the values (without the index), use the `values` attribute.

group.agg(method).loc['Australia', 'female']['BloodType'].values

array([dict_items([('B+', 205), ('O-', 41), ('AB+', 54), ('O+', 328), ('A+', 229), ('A-', 29), ('B-', 13), ('AB-', 4)])],
      dtype=object)

Cross-tabulation is an effective way to have a quick look at how the data is distributed across categories.

pd.crosstab(data.Gender, data.BloodType, margins=True)

BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-	All
Gender
female	658	83	128	10	543	33	929	116	2500
male	650	87	114	17	570	31	934	97	2500
All	1308	170	242	27	1113	64	1863	213	5000

pd.crosstab([data.Country, data.Gender], data.BloodType, margins=True)

	BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-	All
Country	Gender
Australia	female	229	29	54	4	205	13	328	41	903
Australia	male	239	30	46	4	214	11	320	33	897
France	female	76	11	10	0	47	2	79	16	241
France	male	69	6	9	1	59	2	105	8	259
Iceland	female	22	5	2	0	17	4	34	8	92
Iceland	male	23	1	3	0	27	1	52	1	108
Italy	female	102	14	22	1	89	1	160	17	406
Italy	male	105	17	15	4	96	3	139	15	394
New Zealand	female	75	5	8	1	53	2	91	9	244
New Zealand	male	79	4	13	4	47	3	94	12	256
United Kingdom	female	65	6	14	1	55	4	99	13	257
United Kingdom	male	67	17	11	3	47	3	89	6	243
United States	female	89	13	18	3	77	7	138	12	357
United States	male	68	12	17	1	80	8	135	22	343
All		1308	170	242	27	1113	64	1863	213	5000

F. GroupBy on continuous features.

We’ve seen how we can group the data according to the features that are in the original dataset.

For instance, grouping the data by country is trivial. Same for the blood type because these are categorical data.

But what about Age? If we do a groupby without thinking too much about it we get something rather useless.

g = data.groupby('Age')
g['BloodType'].describe().unstack()

       Age
count  24     37
       25     72
       26     73
       27     86
       28     69
       29     81
       30     71
       31     86
       32     73
       33     78
       34     54
       35     79
       36     70
       37     88
       38     70
       39     81
       40     62
       41     75
       42     66
       43     70
       44     67
       45     80
       46     85
       47     65
       48     77
       49     88
       50     77
       51     75
       52     83
       53     86
              ..
freq   62     28
       63     41
       64     26
       65     26
       66     38
       67     28
       68     35
       69     36
       70     25
       71     25
       72     39
       73     21
       74     32
       75     39
       76     21
       77     20
       78     33
       79     32
       80     30
       81     21
       82     34
       83     27
       84     25
       85     25
       86     28
       87     32
       88     23
       89     30
       90     24
       91     18
Length: 272, dtype: object

We get one entry for each year of Age which isn’t what we want.

Ideally, we’d want to split the data into age groups and use that to group the data. Pandas has a cut function that allows us to do that.

data.Age.describe()

count    5000.000000
mean       57.447600
std        19.339422
min        24.000000
25%        41.000000
50%        57.000000
75%        74.000000
max        91.000000
Name: Age, dtype: float64

bins = np.arange(20, 100, 10)
bins

array([20, 30, 40, 50, 60, 70, 80, 90])

labels = pd.cut(data.Age, bins)
labels[:5]

0    (50, 60]
1    (50, 60]
2    (50, 60]
3    (80, 90]
4    (70, 80]
Name: Age, dtype: category
Categories (7, interval[int64]): [(20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80] < (80, 90]]

data.Age.head()

0    51
1    57
2    55
3    86
4    73
Name: Age, dtype: int64

We can use the newly created label to partition the Age variable into intervals.

grouped = data.groupby(['Country', labels])

grouped.size().unstack().fillna(0)

Age             (20, 30]  (30, 40]  (40, 50]  (50, 60]  (60, 70]  (70, 80]  \
Country                                                                      
Australia            177       254       269       275       257       280   
France                48        76        68        71        75        76   
Iceland               16        32        36        24        30        26   
Italy                 74       131       120       134       128        96   
New Zealand           49        76        81        70        93        57   
United Kingdom        54        69        77        64        72        91   
United States         71       103        99       111       102       115   

Age             (80, 90]  
Country                   
Australia            272  
France                82  
Iceland               34  
Italy                111  
New Zealand           71  
United Kingdom        69  
United States         94

If we want the Age groups to be our index we can use unstack(0) or much the more explicit and therefore better, unstack('Country') .

grouped.size().unstack('Country').fillna(0)

Country	Australia	France	Iceland	Italy	New Zealand	United Kingdom	United States
Age
(20, 30]	177	48	16	74	49	54	71
(30, 40]	254	76	32	131	76	69	103
(40, 50]	269	68	36	120	81	77	99
(50, 60]	275	71	24	134	70	64	111
(60, 70]	257	75	30	128	93	72	102
(70, 80]	280	76	26	96	57	91	115
(80, 90]	272	82	34	111	71	69	94

Just like before, we can pass more features to the `groupby` method.

grouped = data.groupby(['Country', 'BloodType', labels])
grouped.size().unstack('BloodType').fillna(0)

	BloodType	A+	A-	AB+	AB-	B+	B-	O+	O-
Country	Age
Australia	(20, 30]	41.0	11.0	13.0	0.0	49.0	3.0	53.0	7.0
	(30, 40]	63.0	8.0	14.0	2.0	57.0	4.0	97.0	9.0
	(40, 50]	73.0	9.0	16.0	2.0	56.0	2.0	100.0	11.0
	(50, 60]	74.0	5.0	16.0	0.0	67.0	2.0	100.0	11.0
	(60, 70]	60.0	7.0	10.0	0.0	55.0	8.0	106.0	11.0
	(70, 80]	85.0	7.0	15.0	1.0	57.0	4.0	101.0	10.0
	(80, 90]	65.0	11.0	16.0	3.0	77.0	1.0	84.0	15.0
France	(20, 30]	17.0	1.0	1.0	0.0	9.0	0.0	19.0	1.0
	(30, 40]	17.0	5.0	1.0	0.0	22.0	0.0	28.0	3.0
	(40, 50]	17.0	3.0	5.0	1.0	12.0	0.0	27.0	3.0
	(50, 60]	18.0	1.0	1.0	0.0	14.0	2.0	32.0	3.0
	(60, 70]	23.0	1.0	3.0	0.0	18.0	1.0	25.0	4.0
	(70, 80]	22.0	5.0	4.0	0.0	14.0	0.0	26.0	5.0
	(80, 90]	30.0	1.0	4.0	0.0	17.0	1.0	24.0	5.0
Iceland	(20, 30]	4.0	0.0	0.0	0.0	2.0	0.0	10.0	0.0
	(30, 40]	7.0	2.0	1.0	0.0	5.0	1.0	15.0	1.0
	(40, 50]	5.0	1.0	1.0	0.0	10.0	0.0	17.0	2.0
	(50, 60]	8.0	0.0	1.0	0.0	3.0	0.0	11.0	1.0
	(60, 70]	5.0	1.0	0.0	0.0	10.0	0.0	12.0	2.0
	(70, 80]	7.0	0.0	1.0	0.0	6.0	3.0	9.0	0.0
	(80, 90]	9.0	2.0	1.0	0.0	8.0	1.0	10.0	3.0
Italy	(20, 30]	19.0	5.0	0.0	1.0	23.0	0.0	22.0	4.0
	(30, 40]	39.0	8.0	4.0	1.0	29.0	0.0	45.0	5.0
	(40, 50]	36.0	4.0	4.0	2.0	27.0	1.0	40.0	6.0
	(50, 60]	27.0	2.0	5.0	0.0	32.0	1.0	62.0	5.0
	(60, 70]	33.0	2.0	8.0	1.0	27.0	1.0	52.0	4.0
	(70, 80]	33.0	4.0	6.0	0.0	22.0	0.0	29.0	2.0
	(80, 90]	19.0	5.0	8.0	0.0	25.0	1.0	47.0	6.0
New Zealand	(20, 30]	13.0	0.0	0.0	0.0	11.0	1.0	22.0	2.0
	(30, 40]	19.0	1.0	4.0	1.0	17.0	1.0	29.0	4.0
	(40, 50]	17.0	0.0	2.0	0.0	18.0	1.0	39.0	4.0
	(50, 60]	24.0	2.0	4.0	1.0	8.0	1.0	29.0	1.0
	(60, 70]	33.0	3.0	6.0	1.0	21.0	0.0	26.0	3.0
	(70, 80]	24.0	1.0	3.0	0.0	10.0	1.0	17.0	1.0
	(80, 90]	23.0	2.0	2.0	2.0	15.0	0.0	21.0	6.0
United Kingdom	(20, 30]	12.0	2.0	2.0	0.0	14.0	1.0	21.0	2.0
	(30, 40]	21.0	2.0	7.0	0.0	13.0	3.0	23.0	0.0
	(40, 50]	22.0	2.0	5.0	0.0	14.0	0.0	31.0	3.0
	(50, 60]	14.0	3.0	3.0	1.0	14.0	1.0	24.0	4.0
	(60, 70]	15.0	7.0	4.0	2.0	11.0	0.0	31.0	2.0
	(70, 80]	28.0	3.0	2.0	1.0	24.0	1.0	28.0	4.0
	(80, 90]	19.0	4.0	2.0	0.0	10.0	1.0	29.0	4.0
United States	(20, 30]	21.0	3.0	4.0	0.0	16.0	1.0	21.0	5.0
	(30, 40]	21.0	1.0	6.0	0.0	26.0	1.0	41.0	7.0
	(40, 50]	23.0	3.0	3.0	1.0	30.0	1.0	31.0	7.0
	(50, 60]	27.0	6.0	7.0	1.0	25.0	1.0	40.0	4.0
	(60, 70]	17.0	6.0	5.0	2.0	26.0	2.0	39.0	5.0
	(70, 80]	22.0	1.0	4.0	0.0	18.0	5.0	61.0	4.0
	(80, 90]	24.0	5.0	6.0	0.0	15.0	3.0	39.0	2.0

The End…

python pandas groupby data wrangling

Unicode strings in Python

Posted on October 5, 2018 | 16 minutes | 3273 words | Adel Rahmani

1. Unicode: an introduction.

Computers only understand one thing: binary code. For instance, when you type the letter “A” the computer must represent this as a sequence of 0s and 1s.

Therefore we need a rule (code) for converting between characters and sequences of 0s and 1s. A basic way to do this is provided by the ASCII code.

The ASCII code uses 1 byte (8 bits) to encode 256 possible characters, corresponding to a binary number between 0 and $255 = 2^7+2^6+2^5+2^4+2^3+2^2+2^1+2^0$.

These can be split into 3 groups (see the link above). The usual characters used in English are those whose ASCII codes are between 32 and 127.

For instance the letter “A” corresponds to the number 65 which is treated by the computer as the binary sequence $1000001$.

In Python, the ASCII code of a given letter can be obtained using the ord function. It can be converted to binary using bin, and to hexadecimal (base 16) code using hex.

Conversely, the character corresponding to a given ASCII code can be obtained using the chr function.

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


print("Character: A")
print("ASCII code:", ord('A'))
print("Binary code:", bin(ord('A')))
print("Hexadecimal code:", hex(ord('A')))
print("chr(65):", chr(65))

Character: A
ASCII code: 65
Binary code: 0b1000001
Hexadecimal code: 0x41
chr(65): A

The result above tells us that the binary code for A (starting with 0b) corresponds to the decimal number $2^6+2^0=64+1=65$.

Similarly, the hexadecimal code (starting with 0x) tells us that A corresponds to $4\times 16^1+1\times 16^0=64+1=65$ in old money (base 10).

Let’s have a look at all the characters whose ASCII code is between 32 and 127

print([chr(i) for i in range(32, 127)])

[' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']

Because the ASCII code is using (at most) 1 byte, it cannot represent more than 256 characters. This is sufficient to writing English text but it's not enough to code other symbols such as non-English alphabet, accented characters, as well as more esoteric characters... Hence, another type of character encoding is needed.

Modern character encoding is built on Unicode which contains (at this time) more than 130,000 characters. Note that not all web browsers/fonts are able to display all these characters. In Unicode each character is represented by a code point. When stored/processed by a computer the character is encoded on 1 or more bytes.

⚠ Beware!

We mustn't confuse the code point and the byte encoding associated with any given character.

Think of the code point as an identification number for a character (kind of like your social security number). This code point is distinct from the binary representation of said character in terms of 1s and 0s.

How do we connect the two?

That's what encodings are for. An encoding provides a convention for passing from code point to bytes.

There are several encodings of Unicode characters. The main one for our purpose is called UTF-8. It uses a variable length encoding (1, 2, 3 or 4 bytes).

Unicode has the advantage of being a superset of ASCII. In other words, all the ASCII codes are also valid Unicode codes.

Here's an example. The euro sign € has code U+20AC. The U+ tells us it's a unicode code point and the 20AC is a hexadecimal number (base 16) which corresponds to 8364 in base 10.

In Python 3 strings are by default unicode, so the type str holds for any valid unicode character, not just those that can be encoded using ASCII.

We can represent code points in Python using a special string starting with \u.

euro = '\u20AC'
print("Character:", euro)
print("Decimal code point:", ord(euro))
print("Hexadecimal code point:", hex(ord(euro)))
print("Binary code point:", bin(ord(euro)))
print("Type:", type(euro)) 
print("Character length:", len(euro))

Character: €
Decimal code point: 8364
Hexadecimal code point: 0x20ac
Binary code point: 0b10000010101100
Type: <class 'str'>
Character length: 1

Notice that we have an string of type str and its length is 1 character.

Notice also that, in this case, the Unicode code point used by Python is actually based on the hexadecimal code associated with the character.

The hexadecimal value 20ac corresponds to the decimal value $2\times 16^3+0\times 16^2+10\times 16^1+12\times 16^0=8192+160+12=8364$.

If we encode the string (character really) using UTF-8, we get its byte representation, which has type bytes and is encoded over 3 bytes.

euroUTF8 = euro.encode('utf-8')
print("UTF-8 encoded character:", euroUTF8)
print("UTF-8 encoded type:", type(euroUTF8)) 
print("UTF-8 encoded character length:", len(euroUTF8))

UTF-8 encoded character: b'\xe2\x82\xac'
UTF-8 encoded type: <class 'bytes'>
UTF-8 encoded character length: 3

Contrast this with:

A_UTF8 = 'A'.encode('utf-8')
print("UTF-8 encoded character:", A_UTF8)
print("UTF-8 encoded type:", type(A_UTF8)) 
print("UTF-8 encoded character length:", len(A_UTF8))

UTF-8 encoded character: b'A'
UTF-8 encoded type: <class 'bytes'>
UTF-8 encoded character length: 1

As long as we don’t confuse unicode and bytes, and get our encodings right, we and Python strings will live happily for ever after.

One way to think about this is that we encode a string to pass it to the computer, and we decode it to show it to a human.

print("Unicode character:", euro)
print("UTF-8 encoded character:", euroUTF8)
print("Decoded bytes:", euroUTF8.decode('utf-8'))

Unicode character: €
UTF-8 encoded character: b'\xe2\x82\xac'
Decoded bytes: €

Note that the default encoding on many systems (at least in the English speaking world) used to be ASCII, but UTF-8 has become the norm.

(There are some additional things happening with the print function, but we won’t get into it here.)

Note that ord and chr work with Unicode.

ord('€')

chr(8364)

'€'

You can also refer to unicode characters using their name in Python using the special ‘\N{NAME}’ syntax.

'\N{EURO SIGN}'

'€'

Most of the agonising pain associated with Unicode in Python has to do with not specifying (or knowing) whether we’re dealing with unicode or bytes.

Once that’s clear, your Python code can handle non-ASCII characters. Let’s start with something easy.

German

s = 'Straße'
sUTF8 = s.encode('utf-8')
s, sUTF8

('Straße', b'Stra\xc3\x9fe')

print("Length of s :", len(s))
print("Length of encoded s:", len(sUTF8))

Length of s : 6
Length of encoded s: 7

print("Unicode :", [c for c in s])

Unicode : ['S', 't', 'r', 'a', 'ß', 'e']

print("UTF-8   :",[c for c in sUTF8])

UTF-8   : [83, 116, 114, 97, 195, 159, 101]

Remark:

Notice how the character ß is represented by the 2 bytes \xc3\x9f however, 223 is the code point for the German Eszett, so is there a problem?

chr(223)

'ß'

In base 16 (hexadecimal), df corresponds to the number 223.

int('df', 16)

We can specify a hexadecimal code in a string using the \x prefix.

'\xdf'

'ß'

So if ‘\xdf’ represents our unicode character, what does “\xc3\x9f” represents?

It represents the encoding of our unicode character over 2 bytes using UTF-8.

We can decode the corresponding byte string and get the unicode character to print to the notebook.

b"\xc3\x9f".decode('utf-8')

'ß'

In a different encoding (e.g. latin-1), the same letter would be encoded as \xdf.

s.encode('latin-1')

b'Stra\xdfe'

French

s = u"Les cons, ça ose tout. C'est même à ça qu'on les reconnaît."
sUTF8 = s.encode('utf-8')
print(s)
print(sUTF8)

Les cons, ça ose tout. C'est même à ça qu'on les reconnaît.
b"Les cons, \xc3\xa7a ose tout. C'est m\xc3\xaame \xc3\xa0 \xc3\xa7a qu'on les reconna\xc3\xaet."

sLATIN1 = s.encode('latin-1')
print(s) 
print(sLATIN1)

Les cons, ça ose tout. C'est même à ça qu'on les reconnaît.
b"Les cons, \xe7a ose tout. C'est m\xeame \xe0 \xe7a qu'on les reconna\xeet."

print("Length of s :", len(s), "\t",  
      "Length of UTF-8 encoded s:", len(sUTF8), "\t",  
      "Length of LATIN-1 encoded s:", len(sLATIN1),
      "\n")

print("Unicode :", [c for c in s], "\n")

print("UTF-8   :",[c for c in sUTF8], "\n")

print("Latin-1 :",[c for c in sLATIN1], "\n")

Length of s : 59     Length of UTF-8 encoded s: 64   Length of LATIN-1 encoded s: 59 

Unicode : ['L', 'e', 's', ' ', 'c', 'o', 'n', 's', ',', ' ', 'ç', 'a', ' ', 'o', 's', 'e', ' ', 't', 'o', 'u', 't', '.', ' ', 'C', "'", 'e', 's', 't', ' ', 'm', 'ê', 'm', 'e', ' ', 'à', ' ', 'ç', 'a', ' ', 'q', 'u', "'", 'o', 'n', ' ', 'l', 'e', 's', ' ', 'r', 'e', 'c', 'o', 'n', 'n', 'a', 'î', 't', '.'] 

UTF-8   : [76, 101, 115, 32, 99, 111, 110, 115, 44, 32, 195, 167, 97, 32, 111, 115, 101, 32, 116, 111, 117, 116, 46, 32, 67, 39, 101, 115, 116, 32, 109, 195, 170, 109, 101, 32, 195, 160, 32, 195, 167, 97, 32, 113, 117, 39, 111, 110, 32, 108, 101, 115, 32, 114, 101, 99, 111, 110, 110, 97, 195, 174, 116, 46] 

Latin-1 : [76, 101, 115, 32, 99, 111, 110, 115, 44, 32, 231, 97, 32, 111, 115, 101, 32, 116, 111, 117, 116, 46, 32, 67, 39, 101, 115, 116, 32, 109, 234, 109, 101, 32, 224, 32, 231, 97, 32, 113, 117, 39, 111, 110, 32, 108, 101, 115, 32, 114, 101, 99, 111, 110, 110, 97, 238, 116, 46]

This is fine but these languages are using the usual latin alphabet with the exception of a few special characters such as accents.

What about languages that use a completely different alphabet? Let’s find out.

Greek

(source: http://sacred-texts.com/cla/homer/greek/ili01.htm)

# unicode string
iliad =  "μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος"
print(iliad)
print(type(iliad))
print(len(iliad))

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
<class 'str'>
34

# byte string
iliad = "μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος".encode('utf-8')
print(iliad)
print(type(iliad))
print(len(iliad))

b'\xce\xbc\xe1\xbf\x86\xce\xbd\xce\xb9\xce\xbd \xe1\xbc\x84\xce\xb5\xce\xb9\xce\xb4\xce\xb5 \xce\xb8\xce\xb5\xe1\xbd\xb0 \xce\xa0\xce\xb7\xce\xbb\xce\xb7\xce\xb9\xcc\x88\xe1\xbd\xb1\xce\xb4\xce\xb5\xcf\x89 \xe1\xbc\x88\xcf\x87\xce\xb9\xce\xbb\xe1\xbf\x86\xce\xbf\xcf\x82'
<class 'bytes'>
70

⚠ Stop and think!

So how many characters are there in the string? 70 or 34? How would you list the individual characters within the string?

The answer is 34. These are the actual characters that compose the string.

The encoded version will have a different number of bytes depending on the encoding.

# unicode string
iliad =  "μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος"

L = [c for c in iliad]
for i, c in enumerate(L):
    print(i+1, "\t", c)

# byte string
iliad_encoded =  "μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος".encode('utf-8')

L = [c for c in iliad_encoded]
for i, c in enumerate(L):
    print(i+1, "\t", c)

What’s going on?

We have a sequence of bytes and we are outputing each individual byte value.

Python has no way of knowing what characters these bytes represent, hence we see a bunch of numbers.

The same applies to any other language.

Hebrew

(source: http://sacred-texts.com/bib/tan/gen001.htm)

# unicode string
gen1 = "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ"
print(gen1) 
print(type(gen1))
print(len(gen1))

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ
<class 'str'>
64

# byte string
gen1_UTF8 = "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ".encode('utf-8')
print(gen1_UTF8) 
print(type(gen1_UTF8))
print(len(gen1_UTF8))

b'\xd7\x91\xd6\xbc\xd6\xb0\xd7\xa8\xd6\xb5\xd7\x90\xd7\xa9\xd7\x81\xd6\xb4\xd6\x96\xd7\x99\xd7\xaa \xd7\x91\xd6\xbc\xd6\xb8\xd7\xa8\xd6\xb8\xd6\xa3\xd7\x90 \xd7\x90\xd6\xb1\xd7\x9c\xd6\xb9\xd7\x94\xd6\xb4\xd6\x91\xd7\x99\xd7\x9d \xd7\x90\xd6\xb5\xd6\xa5\xd7\xaa \xd7\x94\xd6\xb7\xd7\xa9\xd7\x81\xd6\xbc\xd6\xb8\xd7\x9e\xd6\xb7\xd6\x96\xd7\x99\xd6\xb4\xd7\x9d \xd7\x95\xd6\xb0\xd7\x90\xd6\xb5\xd6\xa5\xd7\xaa \xd7\x94\xd6\xb8\xd7\x90\xd6\xb8\xd6\xbd\xd7\xa8\xd6\xb6\xd7\xa5'
<class 'bytes'>
122

Chinese

(source: http://ctext.org/art-of-war)

# unicode string
sun = "孫子曰：兵者，國之大事，死生之地，存亡之道，不可不察也"
print(sun) 
print(type(sun))
print(len(sun))

孫子曰：兵者，國之大事，死生之地，存亡之道，不可不察也
<class 'str'>
27

# byte string
sun_UTF8 = u"孫子曰：兵者，國之大事，死生之地，存亡之道，不可不察也".encode('utf-8')
print(sun_UTF8) 
print(type(sun_UTF8))
print(len(sun_UTF8))

b'\xe5\xad\xab\xe5\xad\x90\xe6\x9b\xb0\xef\xbc\x9a\xe5\x85\xb5\xe8\x80\x85\xef\xbc\x8c\xe5\x9c\x8b\xe4\xb9\x8b\xe5\xa4\xa7\xe4\xba\x8b\xef\xbc\x8c\xe6\xad\xbb\xe7\x94\x9f\xe4\xb9\x8b\xe5\x9c\xb0\xef\xbc\x8c\xe5\xad\x98\xe4\xba\xa1\xe4\xb9\x8b\xe9\x81\x93\xef\xbc\x8c\xe4\xb8\x8d\xe5\x8f\xaf\xe4\xb8\x8d\xe5\xaf\x9f\xe4\xb9\x9f'
<class 'bytes'>
81

Other characters

Unicode is not restricted to what we may traditionally think as human languages. It also includes many other characters.

Let’s create a helper function to print a large number of unicode characters in a nice way.

def print_unicode(start, stop, width=30):
    for code in range(start, stop):
        end = '\n' if (code-start)%width==0 else ' '
        print(chr(code), end=end)

print_unicode(128000, 128300)

🐀
🐁 🐂 🐃 🐄 🐅 🐆 🐇 🐈 🐉 🐊 🐋 🐌 🐍 🐎 🐏 🐐 🐑 🐒 🐓 🐔 🐕 🐖 🐗 🐘 🐙 🐚 🐛 🐜 🐝 🐞
🐟 🐠 🐡 🐢 🐣 🐤 🐥 🐦 🐧 🐨 🐩 🐪 🐫 🐬 🐭 🐮 🐯 🐰 🐱 🐲 🐳 🐴 🐵 🐶 🐷 🐸 🐹 🐺 🐻 🐼
🐽 🐾 🐿 👀 👁 👂 👃 👄 👅 👆 👇 👈 👉 👊 👋 👌 👍 👎 👏 👐 👑 👒 👓 👔 👕 👖 👗 👘 👙 👚
👛 👜 👝 👞 👟 👠 👡 👢 👣 👤 👥 👦 👧 👨 👩 👪 👫 👬 👭 👮 👯 👰 👱 👲 👳 👴 👵 👶 👷 👸
👹 👺 👻 👼 👽 👾 👿 💀 💁 💂 💃 💄 💅 💆 💇 💈 💉 💊 💋 💌 💍 💎 💏 💐 💑 💒 💓 💔 💕 💖
💗 💘 💙 💚 💛 💜 💝 💞 💟 💠 💡 💢 💣 💤 💥 💦 💧 💨 💩 💪 💫 💬 💭 💮 💯 💰 💱 💲 💳 💴
💵 💶 💷 💸 💹 💺 💻 💼 💽 💾 💿 📀 📁 📂 📃 📄 📅 📆 📇 📈 📉 📊 📋 📌 📍 📎 📏 📐 📑 📒
📓 📔 📕 📖 📗 📘 📙 📚 📛 📜 📝 📞 📟 📠 📡 📢 📣 📤 📥 📦 📧 📨 📩 📪 📫 📬 📭 📮 📯 📰
📱 📲 📳 📴 📵 📶 📷 📸 📹 📺 📻 📼 📽 📾 📿 🔀 🔁 🔂 🔃 🔄 🔅 🔆 🔇 🔈 🔉 🔊 🔋 🔌 🔍 🔎
🔏 🔐 🔑 🔒 🔓 🔔 🔕 🔖 🔗 🔘 🔙 🔚 🔛 🔜 🔝 🔞 🔟 🔠 🔡 🔢 🔣 🔤 🔥 🔦 🔧 🔨 🔩 🔪 🔫

Remark:

These are characters, not images. You can write them to a file on your computer's hard drive, and read them back in.

# UTF-8 is used by default on my machine
with open('unicode.txt', 'w') as f:
    for code in range(128000, 128300):
        f.write(chr(code))

!cat unicode.txt

🐀🐁🐂🐃🐄🐅🐆🐇🐈🐉🐊🐋🐌🐍🐎🐏🐐🐑🐒🐓🐔🐕🐖🐗🐘🐙🐚🐛🐜🐝🐞🐟🐠🐡🐢🐣🐤🐥🐦🐧🐨🐩🐪🐫🐬🐭🐮🐯🐰🐱🐲🐳🐴🐵🐶🐷🐸🐹🐺🐻🐼🐽🐾🐿👀👁👂👃👄👅👆👇👈👉👊👋👌👍👎👏👐👑👒👓👔👕👖👗👘👙👚👛👜👝👞👟👠👡👢👣👤👥👦👧👨👩👪👫👬👭👮👯👰👱👲👳👴👵👶👷👸👹👺👻👼👽👾👿💀💁💂💃💄💅💆💇💈💉💊💋💌💍💎💏💐💑💒💓💔💕💖💗💘💙💚💛💜💝💞💟💠💡💢💣💤💥💦💧💨💩💪💫💬💭💮💯💰💱💲💳💴💵💶💷💸💹💺💻💼💽💾💿📀📁📂📃📄📅📆📇📈📉📊📋📌📍📎📏📐📑📒📓📔📕📖📗📘📙📚📛📜📝📞📟📠📡📢📣📤📥📦📧📨📩📪📫📬📭📮📯📰📱📲📳📴📵📶📷📸📹📺📻📼📽📾📿🔀🔁🔂🔃🔄🔅🔆🔇🔈🔉🔊🔋🔌🔍🔎🔏🔐🔑🔒🔓🔔🔕🔖🔗🔘🔙🔚🔛🔜🔝🔞🔟🔠🔡🔢🔣🔤🔥🔦🔧🔨🔩🔪🔫

Remark:

We can also get Python to print out the names of the characters.

import unicodedata
for c in range(128000, 128100):
    print(chr(c), unicodedata.name(chr(c)))

🐀 RAT
🐁 MOUSE
🐂 OX
🐃 WATER BUFFALO
🐄 COW
🐅 TIGER
🐆 LEOPARD
🐇 RABBIT
🐈 CAT
🐉 DRAGON
🐊 CROCODILE
🐋 WHALE
🐌 SNAIL
🐍 SNAKE
🐎 HORSE
🐏 RAM
🐐 GOAT
🐑 SHEEP
🐒 MONKEY
🐓 ROOSTER
🐔 CHICKEN
🐕 DOG
🐖 PIG
🐗 BOAR
🐘 ELEPHANT
🐙 OCTOPUS
🐚 SPIRAL SHELL
🐛 BUG
🐜 ANT
🐝 HONEYBEE
🐞 LADY BEETLE
🐟 FISH
🐠 TROPICAL FISH
🐡 BLOWFISH
🐢 TURTLE
🐣 HATCHING CHICK
🐤 BABY CHICK
🐥 FRONT-FACING BABY CHICK
🐦 BIRD
🐧 PENGUIN
🐨 KOALA
🐩 POODLE
🐪 DROMEDARY CAMEL
🐫 BACTRIAN CAMEL
🐬 DOLPHIN
🐭 MOUSE FACE
🐮 COW FACE
🐯 TIGER FACE
🐰 RABBIT FACE
🐱 CAT FACE
🐲 DRAGON FACE
🐳 SPOUTING WHALE
🐴 HORSE FACE
🐵 MONKEY FACE
🐶 DOG FACE
🐷 PIG FACE
🐸 FROG FACE
🐹 HAMSTER FACE
🐺 WOLF FACE
🐻 BEAR FACE
🐼 PANDA FACE
🐽 PIG NOSE
🐾 PAW PRINTS
🐿 CHIPMUNK
👀 EYES
👁 EYE
👂 EAR
👃 NOSE
👄 MOUTH
👅 TONGUE
👆 WHITE UP POINTING BACKHAND INDEX
👇 WHITE DOWN POINTING BACKHAND INDEX
👈 WHITE LEFT POINTING BACKHAND INDEX
👉 WHITE RIGHT POINTING BACKHAND INDEX
👊 FISTED HAND SIGN
👋 WAVING HAND SIGN
👌 OK HAND SIGN
👍 THUMBS UP SIGN
👎 THUMBS DOWN SIGN
👏 CLAPPING HANDS SIGN
👐 OPEN HANDS SIGN
👑 CROWN
👒 WOMANS HAT
👓 EYEGLASSES
👔 NECKTIE
👕 T-SHIRT
👖 JEANS
👗 DRESS
👘 KIMONO
👙 BIKINI
👚 WOMANS CLOTHES
👛 PURSE
👜 HANDBAG
👝 POUCH
👞 MANS SHOE
👟 ATHLETIC SHOE
👠 HIGH-HEELED SHOE
👡 WOMANS SANDAL
👢 WOMANS BOOTS
👣 FOOTPRINTS

Similarly, we can find out the names of the characters we used previously.

for c in iliad:
    print(c, unicodedata.name(c))

μ GREEK SMALL LETTER MU
ῆ GREEK SMALL LETTER ETA WITH PERISPOMENI
ν GREEK SMALL LETTER NU
ι GREEK SMALL LETTER IOTA
ν GREEK SMALL LETTER NU
  SPACE
ἄ GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA
ε GREEK SMALL LETTER EPSILON
ι GREEK SMALL LETTER IOTA
δ GREEK SMALL LETTER DELTA
ε GREEK SMALL LETTER EPSILON
  SPACE
θ GREEK SMALL LETTER THETA
ε GREEK SMALL LETTER EPSILON
ὰ GREEK SMALL LETTER ALPHA WITH VARIA
  SPACE
Π GREEK CAPITAL LETTER PI
η GREEK SMALL LETTER ETA
λ GREEK SMALL LETTER LAMDA
η GREEK SMALL LETTER ETA
ι GREEK SMALL LETTER IOTA
̈ COMBINING DIAERESIS
ά GREEK SMALL LETTER ALPHA WITH OXIA
δ GREEK SMALL LETTER DELTA
ε GREEK SMALL LETTER EPSILON
ω GREEK SMALL LETTER OMEGA
  SPACE
Ἀ GREEK CAPITAL LETTER ALPHA WITH PSILI
χ GREEK SMALL LETTER CHI
ι GREEK SMALL LETTER IOTA
λ GREEK SMALL LETTER LAMDA
ῆ GREEK SMALL LETTER ETA WITH PERISPOMENI
ο GREEK SMALL LETTER OMICRON
ς GREEK SMALL LETTER FINAL SIGMA

for c in gen1:
    print(c, unicodedata.name(c))

ב HEBREW LETTER BET
ּ HEBREW POINT DAGESH OR MAPIQ
ְ HEBREW POINT SHEVA
ר HEBREW LETTER RESH
ֵ HEBREW POINT TSERE
א HEBREW LETTER ALEF
ש HEBREW LETTER SHIN
ׁ HEBREW POINT SHIN DOT
ִ HEBREW POINT HIRIQ
֖ HEBREW ACCENT TIPEHA
י HEBREW LETTER YOD
ת HEBREW LETTER TAV
  SPACE
ב HEBREW LETTER BET
ּ HEBREW POINT DAGESH OR MAPIQ
ָ HEBREW POINT QAMATS
ר HEBREW LETTER RESH
ָ HEBREW POINT QAMATS
֣ HEBREW ACCENT MUNAH
א HEBREW LETTER ALEF
  SPACE
א HEBREW LETTER ALEF
ֱ HEBREW POINT HATAF SEGOL
ל HEBREW LETTER LAMED
ֹ HEBREW POINT HOLAM
ה HEBREW LETTER HE
ִ HEBREW POINT HIRIQ
֑ HEBREW ACCENT ETNAHTA
י HEBREW LETTER YOD
ם HEBREW LETTER FINAL MEM
  SPACE
א HEBREW LETTER ALEF
ֵ HEBREW POINT TSERE
֥ HEBREW ACCENT MERKHA
ת HEBREW LETTER TAV
  SPACE
ה HEBREW LETTER HE
ַ HEBREW POINT PATAH
ש HEBREW LETTER SHIN
ׁ HEBREW POINT SHIN DOT
ּ HEBREW POINT DAGESH OR MAPIQ
ָ HEBREW POINT QAMATS
מ HEBREW LETTER MEM
ַ HEBREW POINT PATAH
֖ HEBREW ACCENT TIPEHA
י HEBREW LETTER YOD
ִ HEBREW POINT HIRIQ
ם HEBREW LETTER FINAL MEM
  SPACE
ו HEBREW LETTER VAV
ְ HEBREW POINT SHEVA
א HEBREW LETTER ALEF
ֵ HEBREW POINT TSERE
֥ HEBREW ACCENT MERKHA
ת HEBREW LETTER TAV
  SPACE
ה HEBREW LETTER HE
ָ HEBREW POINT QAMATS
א HEBREW LETTER ALEF
ָ HEBREW POINT QAMATS
ֽ HEBREW POINT METEG
ר HEBREW LETTER RESH
ֶ HEBREW POINT SEGOL
ץ HEBREW LETTER FINAL TSADI

for c in sun:
    print(c, unicodedata.name(c))

孫 CJK UNIFIED IDEOGRAPH-5B6B
子 CJK UNIFIED IDEOGRAPH-5B50
曰 CJK UNIFIED IDEOGRAPH-66F0
： FULLWIDTH COLON
兵 CJK UNIFIED IDEOGRAPH-5175
者 CJK UNIFIED IDEOGRAPH-8005
， FULLWIDTH COMMA
國 CJK UNIFIED IDEOGRAPH-570B
之 CJK UNIFIED IDEOGRAPH-4E4B
大 CJK UNIFIED IDEOGRAPH-5927
事 CJK UNIFIED IDEOGRAPH-4E8B
， FULLWIDTH COMMA
死 CJK UNIFIED IDEOGRAPH-6B7B
生 CJK UNIFIED IDEOGRAPH-751F
之 CJK UNIFIED IDEOGRAPH-4E4B
地 CJK UNIFIED IDEOGRAPH-5730
， FULLWIDTH COMMA
存 CJK UNIFIED IDEOGRAPH-5B58
亡 CJK UNIFIED IDEOGRAPH-4EA1
之 CJK UNIFIED IDEOGRAPH-4E4B
道 CJK UNIFIED IDEOGRAPH-9053
， FULLWIDTH COMMA
不 CJK UNIFIED IDEOGRAPH-4E0D
可 CJK UNIFIED IDEOGRAPH-53EF
不 CJK UNIFIED IDEOGRAPH-4E0D
察 CJK UNIFIED IDEOGRAPH-5BDF
也 CJK UNIFIED IDEOGRAPH-4E5F

The End.

python unicode

That ain't dancing Sally!

Posted on August 29, 2018 | 5 minutes | 994 words | Adel Rahmani

Same difference.

Congratulations! Your simulation code from the previous post has impressed all and sundry and you've been asked to teach introductory statistics to first year students at the prestigious Vandelay University.

You've got 2 stats classes, one with a group of 65 students, and another with a group of 35 students. We assume that the students have been randomly allocated to each group, and that they share the same final exam.

The difference between the groups is the teaching technique employed. With group1 you teach a standard stats course, while with group2 you sometimes use interpretive dance instead of equations to explain statistical concepts.

Let's jump right in.

from math import *
import numpy as np

# suppress warning messages.
import warnings
warnings.filterwarnings('ignore')

# import the scipy module which comtains scientific and stats functions.
import scipy.stats as stats

# usual plotting stuff.
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

# set the matplotlib style 
plt.style.use("seaborn-darkgrid")

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Here are the marks (out of 100) in the final exam for these two groups.

group1 = np.array([57, 65, 54, 53, 81, 75, 61, 63, 68, 73, 71, 45, 63, 62, 74, 69, 37,
       55, 78, 55, 54, 65, 95, 48, 53, 61, 63, 65, 60, 74, 52, 50, 33, 39,
       100, 66, 46, 61, 99, 63, 34, 57, 43, 49, 45, 63, 64, 51, 63, 74, 60,
       62, 59, 71, 74, 49, 42, 60, 68, 52, 69, 59, 60, 67, 69])

group2 = np.array([40, 50, 25, 43, 37, 36, 62, 58, 58, 62, 45, 44, 40, 43, 53, 51, 38,
       59, 55, 62, 71, 40, 60, 49, 61, 53, 74, 49, 73, 51, 59, 52, 98, 57,
       52])

print(group1.size, group2.size)

65 35

When you compare the marks between the two groups you notice that, on average, the marks in group2 are lower that those in group1.

diff = group1.mean() - group2.mean()
print(diff)

7.934065934065934

We can visualise the distribution of marks in both groups using a box plot.

fig = plt.figure(figsize=(8, 6))
plt.boxplot((group1, group2))
plt.xlabel("group", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("mark", fontsize=20)
plt.yticks(fontsize=14)
plt.show()

png

Is this difference in means statistically significant (i.e., should you drop the interpretive dance routine)?

Statistics should be able to provide us with an answer. Thankfully, some of my best friends are statisticians. I found one of those magnificent statistical creatures, and I asked them about differences between means of two groups. They mumbled something about beer, students, and tea test or something… I couldn’t tell what they were saying so I went to scipy instead and it turns out there’s an app for that.

The t-test is a statistical test that can tell us whether the difference in means between our two groups is statistically significant.

First, let us observe that the two groups have similar variances (this allows us to run a particular flavour of the test).

group1.var(), group2.var()

(179.0248520710059, 175.55102040816323)

Close enough for us. Let’s run the test.

t, p = stats.ttest_ind(group1, group2, equal_var=True)

print(f'Probability that a difference at least as extreme as {diff:0.2f} is due to chance (t test): {p*100:.2f}%')

Probability that a difference at least as extreme as 7.93 is due to chance (t test): 0.60%

But what does it all mean?

To the simulation!

We are trying to test whether there is a genuine (statistically significant) difference between the two groups.

One way that we can test this is to estimate how likely we are to observe a difference between the means of the two groups of at least 7.93, if we assume that there’s no difference in marks between the two groups (null hypothesis).

We can accomplish that by pooling all the values together and randomly shuffling (assigning) 65 of these values to group1 and the rest to group2.

Using this shuffling scheme, we will get an average difference between the two groups around zero, and the spread of the values we get will tell us how extreme (i.e., unlikely) a value of 7.93 or larger would be under the null hypothesis.

Let’s randomly shuffle the data across the two groups and compute the difference in means 100,000 times. (this may take a moment)

N = 100000

np.random.seed(12345)

# Let's pool the marks together
a = np.concatenate((group1, group2))

i = group1.size

L = []
for _ in range(N):
    # shuffle the data using random permutation (most useless code comment ever!)
    shuffle = np.random.permutation(a)
    
    # split the shuffled data into 2 groups
    grp1, grp2 = shuffle[:i], shuffle[i:]
    
    # compute the difference in means
    L.append(np.mean(grp1) - np.mean(grp2))
    
L = np.array(L)

Let’s plot a histogram of the results.

plt.figure(figsize=(12, 6))
plt.hist(L, bins=50, normed=False, edgecolor='w')
plt.title('Difference between group means', fontsize=18)
plt.axvline(x=diff, ymin=0, ymax=1, color='r')
plt.annotate("Observed\ndifference\n($7.93$)", xy=(8.5, 5000), color='r', fontsize=14)
plt.xlabel("difference in means", fontsize=20)
plt.xticks(fontsize=14)
plt.ylabel("count", fontsize=20)
plt.yticks(fontsize=14)
plt.show()

png

On the histogram, we see that the observed difference is quite far from the mode of the distribution.

In other words, it appears that a difference of 7.93 or more (in magnitude) does not occur very often. Let’s quantify this.

Proportion of simulated trials where the (absolute value of the) difference exceeds the observed difference.

pSim = np.mean((np.abs(L) > diff))
print(f'Probability that the difference at least as extreme as {diff:0.2f} is due to chance (Simulation): {pSim*100:.2f}%')

Probability that the difference at least as extreme as 7.93 is due to chance (Simulation): 0.58%

This is not too bad considering the true result is 0.60%.

This result means that if we assume that the two groups are sampled from the same population of students, the probability of observing a difference in means of at least 7.93 between the group just by random chance is only around 0.6%.

It is quite typical for the threshold for statistical significance to be set at 5%. Therefore, in this case, we’d conclude that the difference between the two groups is statistically significant. In other words, the teaching method has an impact on the marks. You might want to put that leotard away, stop the gyrations cause that ain’t dancing Sally!

python probability statistics seinfeld

Happy birthday to you!

Posted on June 29, 2018 |  15 minutes |  2998 words |  Adel Rahmani

The Birthday Problem is a classic problem in probability theory. We will use it to illustrate how to use simulation to get an estimate of the probability of some event occuring.

Imagine that you are at a party with $N$ people (including yourself). Let's assume that we have 365 days in a year and that the birthday of the people attending the party is randomly distributed with a uniform distribution, meaning that each of the 365 days is equally probable as a birthday (if you were born on February 29, you'll need to go to another party).

How large does $N$ need to be for the probability of having at least 2 people share a birthday to be at least 50%? What about 90%?

(Note that this is different from asking for the probability of anyone sharing a birthday with a given person)

We can easily work out the edge cases. We know that if we have 1 person at the party, the probability of 2 or more people sharing a birthday is 0, and if we have 366 or more people, that probability will be 1 owing to the pigeonhole principle.

It also makes intuitive sense that the probability of shared birthdays increases monotonically as the number of guests increases, therefore, we know that we will reach the 50% threshold somewhere between 2 and 365 guests. Not a great guesstimate by any means, but nonetheless, it gives us a starting point.

This is the kind of problem that can be solved using probability theory, however, let's see if we can get to the answer using simulation.

Let's break the problem down into pieces.

from math import * import numpy as np # suppress warning messages. import warnings warnings.filterwarnings('ignore') # import the scipy module which comtains scientific and stats functions. import scipy as sp import matplotlib.pyplot as plt from matplotlib import cm %matplotlib inline # set the matplotlib style plt.style.use("seaborn-darkgrid") from IPython.core.display import display, HTML display(HTML("<style>.container { width:100% !important; }</style>"))

First, we define a numpy array called days that contains the day numbers for the year (1 to 365).

days = np.arange(1, 366) days
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365])

Assuming there are $N=50$ people at the party, let’s create an array guests_bday that will contain $N$ day numbers randomly drawn (with replacement) from the array days.

We’ll use the random seed 42 to ensure that ours results are reproducible.

N = 50 np.random.seed(42) guests_bday = np.random.choice(days, N) guests_bday
array([103, 349, 271, 107, 72, 189, 21, 103, 122, 215, 331, 88, 100, 360, 152, 131, 150, 309, 258, 344, 294, 192, 277, 161, 314, 22, 253, 236, 345, 49, 59, 170, 188, 271, 190, 175, 51, 364, 55, 244, 320, 131, 307, 135, 21, 329, 167, 274, 89, 316])

The number of unique birthdays in guests_bday is given by

num_unique = np.unique(guests_bday).size num_unique
46

In this example, we have 46 unique birthdays for 50 guests.

Now that we know how to compute the number of unique birthdays for a particular sample of $N$ guests, let’s write a function shared_birthdays that takes integers $N$ and n_trials as parameters, and returns the proportion of trials where two or more people share a birthday.

def shared_birthdays(N, n_trials=1000, seed=42): np.random.seed(seed) return np.mean([np.unique(np.random.choice(days, size=N)).size < N for _ in range(n_trials)])

Rather that manually changing the number of people at the party, let’s consider all possibilities between 0 and 100 at once (Hang on! A party with 0 people? “It happens”, the poor wretch replied, quickly looking away to wipe a tear with a désinvolture that belied his inner turmoil…).

number_of_people = range(101)
sim = np.array([shared_birthdays(N) for N in number_of_people])

We can now find the threshold above which we have a probability of shared birthday > 50%.

idx = np.where(sim > 0.5)[0][0] number_of_people[idx]
23

According to our computation, if we have 23 or more people at the party, the probability of at least 2 people sharing a birthday is larger than 50%.

Similarly, to reach the 90% threshold…

idx = np.where(sim > 0.9)[0][0] number_of_people[idx]
41

… we need at least 41 guests.

Let’s plot the result of the simulation

fig = plt.figure(figsize=(12, 6)) plt.plot(number_of_people, sim, 'g') plt.axhline(y=0.5, c='r', alpha=0.5, linestyle='--') plt.axhline(y=0.9, c='r', alpha=0.5, linestyle='--') plt.xticks(number_of_people[::5], fontsize=14) plt.yticks(np.arange(0, 1.05, 0.1), fontsize=14) plt.ylim(0, 1.01) plt.annotate("50%", xy = (93, 0.45), fontsize=14) plt.annotate("90%", xy = (93, 0.85), fontsize=14) plt.xlabel('number of guests', fontsize=16) plt.ylabel('probability', fontsize=16) plt.title('Probability of at least 2 people sharing a birthday', fontsize=18) plt.show()

So how can we check these results against the exact probability? When trying to determine the probability of a particular event occuring, it's often useful to consider the probability of the event __not__ occuring. Since these are two mutually exclusive events, their probabilities must sum to 1.

In our case, given N < 366 guests, the probability of no two guests sharing a birthday can be computed in the following way.

Assume that no two guests share a birthday. Let's consider a guest which we'll label as guest1. This guest will take one "birthday slot" out of 365. We then choose another guest, guest2, whose birthday must fall on one of the 364 unoccupied slots left. We then choose another guest, guest3, whose birthday must fall on one of the 363 unoccupied slots left. And on and on till we choose guestN whose birthday must fall on one of the (365-N+1) unoccupied slots left (recall that N < 366).

We can summarise this by saying that there are $$365\times364\times\cdots\times(365-N+1)=\frac{365!}{(365-N)!}$$ possible combinations of guests and birthdays that don't lead to a shared birthday.

Since, we're interested in (probabilities) proportions not number of configurations, we need to divide this number by the total number of birthday configurations for $N$ guests, which is just $365^N$ (each guest can have any birthday independently of the others). However, recall that this is the probability of no shared birthdays. To get the result we want we need to subtract this probability from 1. Hence, the exact (under the assumptions of the problem) probability of at least 2 people among $N$ guests sharing a birthday is:$$1 - \frac{365!}{365^N(365-N)!}$$

Let's compute this probability.

def exact_probability_shared_bday(N): return 1 - factorial(365)/factorial(365-N)/365**N exact = np.array([exact_probability_shared_bday(N) for N in number_of_people]) idx = np.where(exact > 0.5)[0][0] print(f"We need {number_of_people[idx]} guests to reach the 50% threshold") idx = np.where(exact > 0.9)[0][0] print(f"We need {number_of_people[idx]} guests to reach the 90% threshold")
We need 23 guests to reach the 50% threshold We need 41 guests to reach the 90% threshold

We get the same results as with the simulation!

Let’s plot the two results on the same graph.

fig = plt.figure(figsize=(12, 6)) plt.plot(number_of_people, sim, 'g', lw=2, alpha=0.8, label='simulation') plt.plot(number_of_people, exact, 'k--', lw=1, label='exact') plt.axhline(y=0.5, c='r', alpha=0.5, linestyle='--') plt.axhline(y=0.9, c='r', alpha=0.5, linestyle='--') plt.xticks(number_of_people[::5], fontsize=14) plt.yticks(np.arange(0, 1.05, 0.1), fontsize=14) plt.ylim(0, 1.01) plt.annotate("50%", xy = (93, 0.45), fontsize=14) plt.annotate("90%", xy = (93, 0.85), fontsize=14) plt.xlabel('number of guests', fontsize=16) plt.ylabel('probability', fontsize=16) plt.title('Probability of at least 2 people sharing a birthday', fontsize=18) plt.legend(loc=4, fontsize=16) plt.show()

Can we generalise our simulation beyond pairs?

The first thing to notice is that our initial strategy isn’t going to work anymore.

np.unique(np.random.choice(days, size=N)).size < N basically checks whether we have more guests than unique birthdays. This is an easy way to check whether at least two guests share a birthday, but it will fail to produce the correct answer if we’re looking to find out know whether at least M guests share a birthay with M > 2.

Instead of looking at the difference between number of guests and number of unique birthdays, let’s count the number of unique birthdays directly. We can do that with numpy’s bincount function.

While we’re at it, we might as well look at variations of this problem, the probability that the maximum number of of people sharing a birthday is M, or the probability of having at most M people share a birthday.

First, let’s illustrate how bincount works.

np.random.seed(42) a = np.random.randint(0, 10, 20) a
array([6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7, 7, 2, 5, 4, 1, 7, 5, 1])
np.bincount(a)
array([0, 2, 2, 2, 3, 2, 3, 5, 0, 1])

This gives us the number of occurences of each value from 0 to 9 in a.

We can make things more explicit in the following way:

for value, count in zip(np.arange(a.max()+1), np.bincount(a)): print(f"The value {value} appears {count} times in a.")
The value 0 appears 0 times in a. The value 1 appears 2 times in a. The value 2 appears 2 times in a. The value 3 appears 2 times in a. The value 4 appears 3 times in a. The value 5 appears 2 times in a. The value 6 appears 3 times in a. The value 7 appears 5 times in a. The value 8 appears 0 times in a. The value 9 appears 1 times in a.

Let’s use bincount to count the largest number of people sharing a birthday in each of our trials.

def at_least_M_shared_birthdays(N, M=2, n_trials=1000, seed=42): np.random.seed(seed) return np.mean([np.bincount(np.random.choice(days, size=N)).max() >= M for _ in range(n_trials)]) def at_most_M_shared_birthdays(N, M=2, n_trials=1000, seed=42): np.random.seed(seed) # the only difference with the previous function is >= becomes <= return np.mean([np.bincount(np.random.choice(days, size=N)).max() <= M for _ in range(n_trials)]) def a_maximum_of_M_shared_birthdays(N, M=2, n_trials=1000, seed=42): np.random.seed(seed) # the only difference with the first function is >= becomes == return np.mean([np.bincount(np.random.choice(days, size=N)).max() == M for _ in range(n_trials)]) # put the three functions in a dictionary to make it easier to choose between them later. shared_bday = {} shared_bday['at least'] = at_least_M_shared_birthdays shared_bday['at most'] = at_most_M_shared_birthdays shared_bday['a maximum of'] = a_maximum_of_M_shared_birthdays

We create a utility function to run the simulation and plot the results.

Recall that M is the number of people sharing a birthday.

def plot_shared_birthday_prob(M=2, max_num_guests=100, how='at least'): number_of_guests = np.arange(1, max_num_guests+1) method = shared_bday[how] sim = np.array([method(N, M=M) for N in number_of_guests]) # let's get the 0.5 and 0.9 thresholds if they are reached. try: threshold50 = number_of_guests[np.where(sim > 0.5)[0][0]] except IndexError: threshold50 = None try: threshold90 = number_of_guests[np.where(sim > 0.9)[0][0]] except IndexError: threshold90 = None step = ceil(max_num_guests / 10) fig = plt.figure(figsize=(12, 6)) plt.plot(number_of_guests, sim, 'g') plt.axhline(y=0.5, c='r', alpha=0.5, linestyle='--') plt.axhline(y=0.9, c='r', alpha=0.5, linestyle='--') plt.xticks(np.array(number_of_guests)[::step]-1, fontsize=14) plt.xlim(1, max_num_guests+1) plt.yticks(np.arange(0, 1.05, 0.1), fontsize=14) plt.ylim(0, 1.01) plt.xlabel('number of guests', fontsize=16) plt.ylabel('probability', fontsize=16) plt.title(f'Probability of {how} {M} people sharing a birthday', fontsize=18) if threshold50 is not None and how == "at least": print(f"50% threshold for {M} shared birthdays reached for {threshold50} guests.") if threshold90 is not None and how == "at least": print(f"90% threshold for {M} shared birthdays reached for {threshold90} guests.") return fig

As a sanity check, let’s redo the computation for pairs (M=2).

fig = plot_shared_birthday_prob(M=2, max_num_guests=100)
50% threshold for 2 shared birthdays reached for 23 guests. 90% threshold for 2 shared birthdays reached for 41 guests.

Looking good!

With this new code we can now ask a different question.

What is the probability that we have at most than 2 people sharing a birthday?

fig = plot_shared_birthday_prob(M=2, max_num_guests=100, how='at most')

This is a monotonically decreasing function of the number of guests (aside from the random noise), which makes sense.

This isn’t the most interesting of results so we’ll skip it henceforth.

Let’s look at the probability of the maximum number of shared birthdays being 2.

fig = plot_shared_birthday_prob(M=2, max_num_guests=100, how='a maximum of')

That’s interesting. The probability starts off like the probability of observing at least 2 people sharing a birthday, but it never reaches the 90% threshold. Instead, after around 45 or so guests the probability starts decreasing. This of course makes sense, as the number of guests increases, we reach a point where having more than 2 people share a birthday is no longer unlikely. When the guests number is larger than 87, the probability of having only singletons (unique birthdays) or pairs dips below 50%. Let’s keep this in mind…

What about trios?

fig = plot_shared_birthday_prob(M=3, max_num_guests=250)
50% threshold for 3 shared birthdays reached for 87 guests. 90% threshold for 3 shared birthdays reached for 132 guests.

Notice that the probability of having 3 or more people sharing a birthday reach the 50% mark for 87 guests.

This is the same value for which the probability of having no more than 2 people share a birthday drops below 50%.

This is not a coincidence as these two events are mutually exclusive. You can either have at most 2 people sharing a birthday ($M \leqslant 2$) or more than 2 people sharing a birthday ($M>2$ which is equivalent to $M \geqslant 3$ since we don’t usually chop up the guests into pieces).

What about having a maximum of 3 people share a birthday?

fig = plot_shared_birthday_prob(M=3, max_num_guests=250, how='a maximum of')

Similarly to the pairs, the trios see their probability wane after about 130 guests, and it dips below 50% at about 188 guests. We can now guess what happens, as the number of guests reaches 188, the probability of having 4 or more people share a birthday is more likely than not. Which takes us to…

Quartets

fig = plot_shared_birthday_prob(M=4, max_num_guests=500)
50% threshold for 4 shared birthdays reached for 188 guests. 90% threshold for 4 shared birthdays reached for 259 guests.

fig = plot_shared_birthday_prob(M=4, max_num_guests=500, how='a maximum of')

Once again we get that now familiar shape (notice that the shape becomes more symmetrical as M increases).

The decrease in probability heralds the arrival of…

Quintets

fig = plot_shared_birthday_prob(M=5, max_num_guests=500)
50% threshold for 5 shared birthdays reached for 314 guests. 90% threshold for 5 shared birthdays reached for 413 guests.

fig = plot_shared_birthday_prob(M=5, max_num_guests=1000, how='a maximum of')

Sextets

(This is starting to take some time to run…)

fig = plot_shared_birthday_prob(M=6, max_num_guests=1000)
50% threshold for 6 shared birthdays reached for 457 guests. 90% threshold for 6 shared birthdays reached for 586 guests.

fig = plot_shared_birthday_prob(M=6, max_num_guests=1000, how='a maximum of')

We are barely reaching the 50% threshold now.

And one last one for the road.

Septets

fig = plot_shared_birthday_prob(M=7, max_num_guests=1000)
50% threshold for 7 shared birthdays reached for 616 guests. 90% threshold for 7 shared birthdays reached for 768 guests.

fig = plot_shared_birthday_prob(M=7, max_num_guests=1000, how='a maximum of')

It looks like were we to smoothen out this plot, the probability would never reach 50%.

There you have it!

Hopefully, this little exploration of the birthday problem illustrates how useful simulation can be. In the case of the generalised version of the problem beyond pairs, we were able to simulate the problem easily by making a small modification to our original code.

Perhaps one question remains. How can we be sure that our simulation gives a reasonable (however we want define it) approximation of the answer?

Well, for pairs we compared our code to the exact result. Can we do the same for M > 2?

It turns out that there exists an exact answer to this problem for the general case, however it is much more involved than our little derivation for M = 2.

For more information I’ll refer you to this paper by Fisher, Funk, and Sams.

We can nonetheless compare our simulation to the authors’ exact result for the 50% threshold for the values of M they used:

| M | 2 | 3 | 4 | 5 | 6 | 7 | |----------------|----|----|-----|-----|-----------| | exact | 23 | 88 | 187 | 313 | N/A | N/A | | our simulation | 23 | 87 | 188 | 314 | 457 | 616 |

We’re in the right ball park. We could probably improve our results a bit by using more than 1000 trials.

It might also be interesting to record the variance across the trials during our simulation.

This, as they say, will be left as an exercise for the reader, and the next time you’re at a birthday party, if there are more than 23 people present be sure to sing “Haaaaappy birthday to youse!“, the pairs, trios, quartets, etc… will be grateful (the person organising the birthday party, not so much, oddly enough. There’s no pleasing some people anyway).

python probability statistics

Do you feel lucky?

Posted on June 15, 2018 |  17 minutes |  3562 words |  Adel Rahmani

A coin flipped 100 times comes up heads 60 times. How likely is it that the coin is fair?

This is a basic example of statistical inference, trying to infer an estimate of some quantity (here the probability for heads) from some data (observation of 100 coin flips). This type of problem is invariably described in every first year stats course. To solve it, we first need to represent the underlying process generating our observations in a mathematical form. For coin flips, the binomial distribution is the natural choice. Let's affectionately call our random variable representing the number of heads $X$.

The probability of observing k heads in a run of N coin flips is given by the probability mass function (pmf):

$$P(X=k) = \left(\begin{array}{c}N \\ k \end{array} \right)p^k\, (1-p)^{N-k} = \dfrac{N!}{k!(N-k)!}\,p^k\, (1-p)^{N-k}$$

But where does this formula come from? It's actually quite simple. If we denote by p the probability of heads (outcome H) for one coin flip, the probability of tails (outcome T) is $1-p$ because we only have 2 mutually exclusive outcomes (we assume the coin doesn't land sideways).

If we flip the coin 2 times, since each coin flip is independent, we multiply the probabilities of the outcomes of each flip. For instance, the probability of observing HH (two heads) is $p^2$. The probability associated with TT is $(1-p)^2$, while the one associated with HT is $p(1-p)$, etc...

So if we flip the coin $N$ times, what is the probability of observing $k$ heads? Well, it's the probability of observing any configuration with $k$ heads and $N-k$ tails, which is $p^k(1-p)^{N-k}$ times the number of such configurations. To count the configurations we simply need to formalise the fact that, for this problem, the ordering doesn't matter. As far as we are concerned, HTH, HHT, and THH are equivalent, they all represent 2 heads in 3 coin flips. Instead of thinking about one coin flipped $N$ times, let's consider the equivalent problem of $N$ identical coins flipped once each.

How many ways can we arrange the $N$ coins? We've got $N$ options when picking the first coin. Once that's done, we've got $N-1$ ways of choosing the second coin, and so on, all the way to the last coin for which there's only one choice left to make, namely that very coin... This means that there are $N\times(N-1)\times\cdots2\times1$ ways or arranging the $N$ coins, which is the factorial of $N$ or $N!$.

Now suppose that $k\leq N$ of these coins have landed heads up. This means of course that $N-k$ coins show tails. If we were to reorder the $k$ coins among themselves, this would not change our result. We'd still have $k$ heads out of $N$ flips. Same with the $N-k$ tails. Therefore, when it comes to counting the number of configurations for which we have $k$ heads, we should divide the total number of ways to arrange $N$ coins by the total number of ways we can rearrange the $k$ heads among themselves, and the $N-k$ tails among themselves. This ends up to be the binomial coefficient called "N choose k":

$$\left(\begin{array}{c}N \\ k \end{array}\right) = \frac{N!}{k!(N-k)!}.$$

Hence the formula above.

One minor detail is that we don't know the value of $p$, the probability that a coin flip lands heads up. For a fair coin, we of course expect that $p=0.5$. Using the formula above we can also compute the probability of 60 heads out of 100 coin flips for a fair coin.

This will be our starting in figuring out how to compute an estimate of $p$.

Let’s import all the modules we need.

# usual stuff from math import * import numpy as np # suppress warning messages. import warnings warnings.filterwarnings('ignore') # import the scipy module which contains scientific and stats functions. import scipy as sp # import matplotlib and redirect plots to jupyter notebook. (cm is for color maps) import matplotlib.pyplot as plt from matplotlib import cm %matplotlib inline # sets the default figure size import matplotlib as mpl mpl.rcParams['figure.figsize'] = (10, 6) # sets some default plotting styles. import seaborn as sns sns.set_style("darkgrid") from IPython.core.display import display, HTML display(HTML("<style>.container { width:100% !important; }</style>"))

Let’s write a Python function that computes the formula above.

N = 100 k = 60 p = 0.5 def choose(N, k): return int(factorial(N)/(factorial(k) * factorial(N-k) )) def binom(N, k, p): return choose(N, k) * p**k * (1-p)**(N-k) binom(N, k, p)
0.010843866711637987

So the pmf tells us that the probability of getting 60 heads out of 100 coin flips is a hair above 1%, if the coin is fair.

Note that this does not tell us if the coin is fair. So let’s determine how likely are we to observe 60 heads or more in a sequence of 100 flips from a fair coin.

There’s typically 2 ways that this problem is solved in introductory stats courses.

Method 1: Binomial Distribution

One-tailed binomial test.

Under the hypothesis that the coin is fair, the probability of observing at least 60 heads in a run of 100 coin flips can be computed using the one-tailed binomial test. One tailed means we only care for differences in one direction, either higher or smaller than the result that is expected under the hypothesis that the coin is fair. Here we only care about the probability of observing 60 or more heads. The scipy module has a suitable implementation of the two-tailed binomial test that we can divide by 2 to get the one-tailed result (note that we can divide the two-tailed result by 2 only because we are testing the hypothesis that the coin is fair and for p=0.5 the distribution is symmetric).

sp.stats.binom_test(k, n=N, p=0.5)/2
0.02844396682049039

That's around 2.8%. This number is nothing more that the sum of the probabilities for all the values of k between 60 and 100.

$$\sum_{k = 60}^{100}\left(\begin{array}{c}N\\k \end{array} \right)p^k\, (1-p)^{N-k}$$

Here's a check using our binom function if you don't believe me.

sum(binom(N, k, p) for k in range(60, 101))
0.028443966820490392

Note:

The sum version has the advantage of stating explicit what the computation is doing. We'll use this approach henceforth whenever we need to perform a binomial test.

Method 2: Normal approximation to the binomial distribution

One-tailed Z-test

When we have a large enough number of coin flips, we can approximate the discrete binomial distribution by a suitably defined continuous normal (gaussian) distribution with mean 50 ($=Np$) and standard deviation $\sqrt{Np(1-p)}$, courtesy of the famous central limit theorem.

This allows us to use the beloved Z test of introductory stats courses to determine the probability of observing at least 60 heads, under the hypothesis that the coin is fair. The Z test is just a way to compute how likely it is to observe a value of the the random variable that is beyond a given distance from its assumed mean, using units of the standard deviation.

Once again, scipy can help us by supplying the cumulative distribution function (cdf) for our normal distribution. The cdf is a function that tells us the probability of getting a value lower than, or equal to the value we are considering. The cdf obviously integrates (sums) to 1. By subtracting the cdf at a given value z_0 from 1, we get the probability of observing a result at least as large as z_0.

The probability we want is:

N = 100 k = 60 p = 0.5 1 - sp.stats.norm.cdf(k-0.5, loc=50, scale = sqrt(N*p*(1-p))) # notice the -0.5?
0.028716559816001852

We could have computed this directly by first computing the Z score for k, which is nothing more than the observed count of heads minus its average for a fair coin (50), divided by its standard deviation.

mean = 50 sd = sqrt(N*p*(1-p)) # standard deviation (the mean is 50 for a fair coin) zscore = (k-0.5 - mean)/sd # notice the -0.5? 1 - sp.stats.norm.cdf(zscore)
0.028716559816001852

Note:
Why did we substract 0.5 from k? It called a continuity correction. We need it to compensate for the fact that we are substituting a continuous distribution for a discrete one.

Method 3: Simulation

This is all nice and peachy, but deep down, do we really understand everything we’ve done above? Coin flips are a fairly trivial concept, so our reasoning should be straightforward, however, what would we have done if we hadn’t been provided with the pmf of the binomial distribution? Since we derived from “first principles” above, it wouldn’t have been much of a problem. However, could we have worked out the normal approximation to the binomial distribution (a tad trickier perhaps, with that continuity-correction business)? There’s got to be a better way! Not really, but there’s a different way: simulation.

The question is how likely is it for a fair coin to land 60 or more times heads up in a sequence of 100 flips?

One way to find out is to flip a fair coin 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads. Then we flip the coin another 100 times and record the number of heads… You get the gist.

If we repeat this a large enough number of times, we should get an estimate of how likely it is for a fair coin to land heads up 60 or more times in a run of 100 flips simply by computing the proportion of trials where we got 60 or more heads.

Let the computer flip some coins for us!

numpy has a very useful function called random.randint that will randomly pick an integer drawn from a uniform random distribution.

For instance, let 1 correspond to heads and 0 to tails, we can simulate 10 flips of a fair coin with:

np.random.randint(0, 2, size=10)
array([1, 1, 1, 1, 0, 0, 1, 0, 0, 1])

Note that we can simulate multiple trials of 10 flips by passing a tuple to the size parameter of the function.

Here’s for instance a series of 3 trials with 10 flips each.

np.random.randint(0, 2, size=(3, 10))
array([[1, 0, 1, 1, 0, 0, 0, 1, 0, 0], [1, 0, 1, 0, 1, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 1, 0, 1, 0, 0]])

Let’s flip our fair coin 200 million times.

We’ll simulate 2 million trials of 100 coin flips. Since we get 1 for heads and 0 for tails, the number of heads in a trial is simply the sum of all the values observed in that trial.

We set the random seed so that we all get the same result, not that it matters that much here.

np.random.seed(0) N = 100 # number of flips per trial repeats = 2000000 # number of trials # we flip the "coin" N times and compute the sum of 1s (heads). # the whole thing is repeated "repeats" times. coin_flips = np.random.randint(0, 2, size=(repeats, N)).sum(axis=1) coin_flips.shape
(2000000,)

Let’s compute the normal approximation to the binomial distribution for our case.

(this is so that we can plot it alongside our histogram)

p = 0.5 # probability of heads for a fair coin x = np.linspace(coin_flips.min(), coin_flips.max(), 100) # 100 evenly spaced values between the min and max counts ndist = sp.stats.norm.pdf(x, loc=p * N, scale = sqrt(N*p*(1-p))) # we use scipy to compute the normal approximation

Let’s plot the histogram of our simulation, as well as the normal approximation.

The red vertical line marks the observed result (number of heads).

plt.hist(coin_flips, bins=50, density=True, align='mid', edgecolor='w') # plot the histogram plt.axvline(x=k, ymin=0, ymax=1, color='r', label="observed # heads") # plot a vertical line at x = k plt.xlabel('k', fontsize=16) # label for the x axis plt.ylabel('$P(k)$', fontsize=16) # label for the y axis plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16) # title of the figure plt.plot(x, ndist, 'c', lw=3, label="normal approx.") # plot normal approximation curve plt.legend(fontsize=14, loc=0) # add legend plt.show() # display the result

Let’s extract the proportion of our 100 coin flip trials which have 60 or more heads from our simulation.

np.mean(coin_flips >= k)
0.0285285

Note:

Computing the mean is the same as adding all the values for which the condition is true, which gives us the number of trials where we got at least k heads, and dividing by the number of trials, which gives us the proportion we're after.

np.sum(coin_flips >= k) / repeats
0.0285285

Note:

In the context of statistical inference (the frequentist kind), this probability is called a p-value. It gives us the probability of observing a result as or more extreme than the one observed ($X\geqslant 60$) if the null hypothesis ($p=0.5$) is true. It also sometimes gives statistics a bad name...

Not a bad estimate of the propability of having $X\geqslant 60$ under the (null) hypothesis that the coin is fair.

So it looks like our simulation works for 60 or more heads. It would be nice to check that it works for other values.

We could either change the value of k above, and re-run our code, or we could copy and paste the code and change the value of k, but neither approach is good programming practice. Whenever we realise that we’d like to run the same piece of code multiple times, for different values of parameters, we should wrap our code in a function.

Let’s do that.

def compute_pvalue(k=50): print("p-value", "\nOur simulation: ", np.mean(coin_flips >= k), "\nScipy (Binomial test): ", sum(binom(N, kvalue, 0.5) for kvalue in range(k, 101)), "\nScipy (Normal Approx): ", 1 - sp.stats.norm.cdf(k-0.5, loc=50, scale = sqrt(N*p*(1-p)))) plt.hist(coin_flips, bins=50, normed=True, edgecolor='w') plt.plot(x, ndist, 'c', lw=3) plt.axvline(x=k, ymin=0, ymax=1, color='r') plt.xlabel('k', fontsize=16) plt.ylabel('$P(k)$', fontsize=16) plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16) plt.show() compute_pvalue(60)
p-value Our simulation: 0.0285285 Scipy (Binomial test): 0.028443966820490392 Scipy (Normal Approx): 0.028716559816001852

We've considered the probability that the number of heads is at least k in a run of 100 coin flips.

Let's write a version of the code above that will compute, and visualise the probability that the number of heads is exactly k in a run of 100 coin flips, under the hypothesis that the coin is fair.

def compute_pvalue_B(k=50): print("p-value", "\nOur simulation: ", np.mean(coin_flips == k), "\nScipy (Binomial test): ", binom(N, k, 0.5), "\nScipy (Normal Approx): ", sp.stats.norm.cdf(k+0.5, loc=50, scale = sqrt(N*p*(1-p))) - sp.stats.norm.cdf(k-0.5, loc=50, scale = sqrt(N*p*(1-p)))) plt.hist(coin_flips, bins=50, edgecolor='w', normed=True) plt.plot(x, ndist, 'c', lw=3) plt.axvline(x=k, ymin=0, ymax=1, color='r') plt.xlabel('k', fontsize=16) plt.ylabel('$P(k)$', fontsize=16) plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16) plt.show() compute_pvalue_B(60)
p-value Our simulation: 0.0108845 Scipy (Binomial test): 0.010843866711637987 Scipy (Normal Approx): 0.010852139253185289

Let's now write a version of the code above that will compute, and visualise the probability that the number of heads is between two values k_1 and k_2 in a run of 100 coin flips, under the hypothesis that the coin is fair (assume that $0\leqslant k_1 \leqslant 50$ and $50 \leqslant k_2 \leqslant 100$).

def compute_pvalue(k1=40, k2=50): k1, k2 = min(k1, k2), max(k1, k2) print("p-value", "\nOur simulation: ", np.mean(np.logical_and(coin_flips >= k1, coin_flips <= k2)), "\nScipy (Binomial test): ", np.sum([binom(N, k, 0.5) for k in range(k1, k2+1)]), "\nScipy (Normal Approx): ", sp.stats.norm.cdf(k2+0.5, loc=50, scale = sqrt(N*p*(1-p)))- sp.stats.norm.cdf(k1-0.5, loc=50, scale = sqrt(N*p*(1-p))) ) plt.hist(coin_flips, bins=50, normed=True, edgecolor='w') plt.plot(x, ndist, 'c', lw=3) plt.axvline(x=k1, ymin=0, ymax=1, color='g') plt.axvline(x=k2, ymin=0, ymax=1, color='r') plt.xlabel('k', fontsize=16) plt.ylabel('$P(k)$', fontsize=16) plt.title('{} coin flips repeated {:,d} times'.format(N, repeats), fontsize=16) plt.show() compute_pvalue(k1=45, k2=60)
p-value Our simulation: 0.847064 Scipy (Binomial test): 0.8467733878542303 Scipy (Normal Approx): 0.8464695184908008

Dice roll

Aside from its usefulness in statistical inference, simulation is a nice way to explore problems which involve counting occurences of a certain event. Since this is the basic concept behind probability theory, simulation can be an convenient way to tackle probabilitic problems.

Let's look at a simple example.

Assuming that we have a regular, fair, six-sided die with sides numbered 1 to 6.

What is the probability of getting:

6 when rolling the die once?

no 6 when rolling the die once?

What about the probability of getting:

no 6 when rolling the die twice?

a single 6 when rolling the die twice?

two 6s when rolling the die twice?

We can of course enumerate (count) all the ways you can obtain the required outcome, and normalise (divide) this by all the possible outcomes, but it’s not a great general strategy. A better way is to work out the exact mathematical formula for the probability we’re interested in. We’ll do that shortly, but first let’s just simulate the process of rolling dice.

Let's create a function roll_dice that takes two integers n_rolls and n_trials as arguments, and returns an array with the number of 6s obtained in each of the n_trials trials when we roll n_rolls dice at the same time.

For instance, roll_dice(2, 10) would return an array of 10 values representing the number of 6s observed when rolling 2 dice in each of the 10 trials.

We'll fix the random seed for reproducibility.

def roll_dice(n_rolls, n_trials): np.random.seed(42) die_rolls = (np.random.randint(1, 7, size = (n_trials, n_rolls)) == 6).sum(axis=1) return die_rolls
roll_dice(2, 10)
array([0, 0, 0, 0, 0, 0, 1, 0, 2, 0])

Note that the values in the array represent the number of 6s observed in each of the 10 trials.

The function below will visualise the result of roll_dice as a histogram.

It uses the roll_dice function to compute the counts of 6s across n_trials trials by default.

It will also output the numerical value for the proportion of 6s in each case.

def plot_histogram(n_rolls = 3, n_trials = 100000): np.random.seed(12345) rolls = roll_dice(n_rolls=n_rolls, n_trials=n_trials) values = plt.hist(rolls, bins=np.arange(n_rolls+2), density=True, align='left', alpha = 0.6, edgecolor='k' ) plt.xticks(np.arange(n_rolls+1), fontsize=13) plt.yticks(np.linspace(0, 1, 11), fontsize=13) plt.title(f'Number of 6s in {n_rolls} rolls over {n_trials:,} trials', fontsize=14) print("\n".join([f"Number of 6s: {i:3d} | Proportion: {values[0][i]:.6f}" for i in range(n_rolls+1)])) plt.show() return
plot_histogram(n_rolls=2)
Number of 6s: 0 | Proportion: 0.695630 Number of 6s: 1 | Proportion: 0.276400 Number of 6s: 2 | Proportion: 0.027970

plot_histogram(n_rolls=3)
Number of 6s: 0 | Proportion: 0.580000 Number of 6s: 1 | Proportion: 0.345960 Number of 6s: 2 | Proportion: 0.069400 Number of 6s: 3 | Proportion: 0.004640

Assuming that we have regular, fair, six-sided dice, what is the probability of getting exactly five 6s when we roll 10 dice at the same time (or the same die 10 times in a row)?

Hints:
The choose function we introduced earlier gives us the number of ways you can pick k interchangeable items from a collection of N elements.
Using the choose function we introduced earlier, we can write the proportion of rolls where we get exactly 5 sixes. In 10 rolls, we have $6^{10}$ possible outcomes. If we observe 5 sixes, this means that we also have 5 values other than 6. These 5 values come in $5^5$ combinations (5 ways to choose something other than 6 for each of the 5 dice). How many ways are there to choose which 5 dice don't show a 6: "10 choose 5" ways. Hence:

choose(10, 5) * 5**5 / 6 ** 10
0.013023810204237159

Meaning that the probability of observing five 6s in 10 rolls is about 1.3%.

We can check that against our dice roll simulation code.

plot_histogram(n_rolls=10, n_trials=1000000)
Number of 6s: 0 | Proportion: 0.161252 Number of 6s: 1 | Proportion: 0.323586 Number of 6s: 2 | Proportion: 0.290073 Number of 6s: 3 | Proportion: 0.155074 Number of 6s: 4 | Proportion: 0.054466 Number of 6s: 5 | Proportion: 0.013067 Number of 6s: 6 | Proportion: 0.002203 Number of 6s: 7 | Proportion: 0.000253 Number of 6s: 8 | Proportion: 0.000026 Number of 6s: 9 | Proportion: 0.000000 Number of 6s: 10 | Proportion: 0.000000

The exact result.

Let's implement the formula for an arbitrary number k of 6s when rolling N fair dice ($k\leqslant N$).

We'll call this function number_of_sixes(N, k).

We’ll use this function to:

check the simulation results in the histogram above.

compute the probability of getting at least five 6s in 10 rolls of a die.

compute the probability of getting at most five 6s in 10 rolls of a die.

def number_of_sixes(N, k): return choose(N, k)*5**(N-k)/6**N for i in range(11): print("Probability of {:2d} 6s in 10 rolls: {:.6f}".format(i, number_of_sixes(10, i))) print("\nProbability of at least five 6s in 10 rolls: {:.6f}".format(sum(number_of_sixes(10, k) for k in range(5, 11)))) print("\nProbability of at most five 6s in 10 rolls: {:.6f}".format(sum(number_of_sixes(10, k) for k in range(6))))
Probability of 0 6s in 10 rolls: 0.161506 Probability of 1 6s in 10 rolls: 0.323011 Probability of 2 6s in 10 rolls: 0.290710 Probability of 3 6s in 10 rolls: 0.155045 Probability of 4 6s in 10 rolls: 0.054266 Probability of 5 6s in 10 rolls: 0.013024 Probability of 6 6s in 10 rolls: 0.002171 Probability of 7 6s in 10 rolls: 0.000248 Probability of 8 6s in 10 rolls: 0.000019 Probability of 9 6s in 10 rolls: 0.000001 Probability of 10 6s in 10 rolls: 0.000000 Probability of at least five 6s in 10 rolls: 0.015462 Probability of at most five 6s in 10 rolls: 0.997562

Our simulation results above are in good agreement with these exact values.

python probability statistics

Seinfeld's Magnificent Seven

Posted on May 10, 2018 |  15 minutes |  3136 words |  Adel Rahmani

In this post I want to take a look at a classic data set: the US baby names.

While the origin of many names can be found in popular culture and (what's the alternative to popular culture? impopular culture?) elsewhere, we shall focus on how the TV show Seinfeld may have influenced the naming of hundreds of innocent babies.

We'll roughly follow the steps outlined by Wes McKinney in his excellent book Python for Data Analysis to load the data.

Let's first start by importing a couple of modules.

from math import * import numpy as np import pandas as pd import os from pathlib import Path %matplotlib inline import matplotlib.pyplot as plt import matplotlib as mpl plt.style.use('seaborn-white') from IPython.core.display import display, HTML display(HTML("<style>.container { width:100% !important; }</style>"))

A. Loading the data

The dataset can be downloaded from http://www.ssa.gov/oact/babynames/limits.html.

(Follow the link and download the dataset in the datasets directory, in a directory called names, and unzip it).

The dataset contains the top 1000 most popular names starting in 1880.

The data for each year is in its own text file. For instance, the data for 1880 is in the file yob1880.txt (which should be located in the datasets/names/ folder.)

The first thing to do is merge the data for all the different years into a single dataframe, however, as a warm up, let’s look at the data for a single year.

input_path = Path('data/names/yob1880.txt') !head {input_path}
Mary,F,7065 Anna,F,2604 Emma,F,2003 Elizabeth,F,1939 Minnie,F,1746 Margaret,F,1578 Ida,F,1472 Alice,F,1414 Bertha,F,1320 Sarah,F,1288
names1880 = pd.read_csv(input_path, names=['name', 'sex', 'births']) names1880.head()

name sex births

0 Mary F 7065

1 Anna F 2604

2 Emma F 2003

3 Elizabeth F 1939

4 Minnie F 1746

Let’s try to automate the loading of all the files.

First we need a list of all the files in our directory.

A Path object has a glob method that returns a generator with the elements within a given directory.

dataset_dir = Path('data/names') files = [item.name for item in sorted(dataset_dir.glob("*.txt"))] # Extract a list of years years = [item[3:7] for item in files]

We gather the data into a single dataset in two steps:

We create a list of dataframes where each element corresponds to a particular year.

We concatenate all the data (why can’t we use merge here?) into a single dataframe.

L = [] for file, year in zip(files, years): path = Path(dataset_dir, file) df = pd.read_csv(path, names=['name', 'sex', 'births']) df['year'] = year L.append(df) names = pd.concat(L) names.head()

name sex births year

0 Mary F 7065 1880

1 Anna F 2604 1880

2 Emma F 2003 1880

3 Elizabeth F 1939 1880

4 Minnie F 1746 1880

It's always a good idea to check the data types of our features (columns). Don't assume that the data you downloaded is in the right format.

If we fix any issue early, it'll make the analysis a lot easier.

names.dtypes
name object sex object births int64 year object dtype: object

The year feature is of type object which is Python’s way of telling us it’s got a generic datatype.

We’d like it to be a datetime type so let’s fix that.

Pandas has several features that make manipulating date and timestamp objects easy.

(We’ll also change the type of sex to category while we’re at it)

names['year'] = pd.to_datetime(names['year']) names['sex'] = names.sex.astype('category')
names.head()

name sex births year

0 Mary F 7065 1880-01-01

1 Anna F 2604 1880-01-01

2 Emma F 2003 1880-01-01

3 Elizabeth F 1939 1880-01-01

4 Minnie F 1746 1880-01-01

Notice how our year column has been changed into a datetime object.

By default, since we only specified the year, the datetime object is set at time 00:00:00 on the 1st of January.

names.dtypes
name object sex category births int64 year datetime64[ns] dtype: object

B. Exploring the data

So how big is our dataset?

names.count()
name 1924665 sex 1924665 births 1924665 year 1924665 dtype: int64

How is the data split on the sex of the baby?

names.sex.value_counts()
F 1138293 M 786372 Name: sex, dtype: int64

Let’s get percentages.

names.sex.value_counts(normalize=True).apply(lambda x: f'{x*100:.0f}%')
F 59% M 41% Name: sex, dtype: object

Let’s find the total number of births for each year, for both sexes.

total_births_per_year_per_sex = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)
total_births_per_year_per_sex.head()

sex F M

year

1880-01-01 90993 110491

1881-01-01 91953 100743

1882-01-01 107847 113686

1883-01-01 112319 104627

1884-01-01 129020 114442

Note:

The pivot_table method is just a convenient shortcut for a groupby operation.

names.groupby(['year', 'sex']).births.sum().unstack().head()

sex F M

year

1880-01-01 90993 110491

1881-01-01 91953 100743

1882-01-01 107847 113686

1883-01-01 112319 104627

1884-01-01 129020 114442

total_births_per_year_per_sex.plot(figsize=(12, 6), title="Total births by sex and year", lw=3, color=('r', 'g'), alpha=0.5);

How many unique names are there for each year and gender?

unique_names = names.groupby(['year', 'sex'])['name'].nunique().unstack()
unique_names.plot(figsize=(12, 6), title="Unique names by sex and year", lw=3, color=('r', 'g'), alpha=0.5);

How many names from 1880 were still used in the most recent year we have, and which ones have fallen into desuetude?

last_year = names.year.dt.year.max() s1880 = set(names[names.year.dt.year == 1880]['name'].unique()) slast = set(names[names.year.dt.year == last_year]['name'].unique()) print(f""" There were {len(s1880)} distinct names recorded in 1880 There were {len(slast)} distinct names recorded in {last_year}\n {len(slast.intersection(s1880))} names were found in both years {len(s1880.difference(slast))} names found in 1880 were not recorded in {last_year} {len(slast.difference(s1880))} names found in {last_year} were not recorded in 1880 """) print(f"The names recorded in 1880 but no longer in use in {last_year} are:\n\n", sorted([item for item in s1880.difference(slast)]))
There were 1889 distinct names recorded in 1880 There were 29910 distinct names recorded in 2017 1501 names were found in both years 388 names found in 1880 were not recorded in 2017 28409 names found in 2017 were not recorded in 1880 The names recorded in 1880 but no longer in use in 2017 are: ['Ab', 'Adelbert', 'Adline', 'Adolf', 'Adolphus', 'Agustus', 'Albertina', 'Albertine', 'Albertus', 'Alcide', 'Alda', 'Alf', 'Algie', 'Almeda', 'Almer', 'Almon', 'Alonza', 'Alpheus', 'Altha', 'Alvah', 'Alvena', 'Alverta', 'Alvia', 'Anner', 'Annis', 'Araminta', 'Arch', 'Ardelia', 'Ardella', 'Arminta', 'Arther', 'Arvid', 'Arvilla', 'Asberry', 'Asbury', 'Aurthur', 'Auther', 'Author', 'Authur', 'Babe', 'Ballard', 'Bama', 'Barnett', 'Barney', 'Bart', 'Bedford', 'Bena', 'Benjaman', 'Benjamine', 'Bertie', 'Berton', 'Bertram', 'Bess', 'Besse', 'Bettie', 'Bird', 'Birt', 'Birtha', 'Birtie', 'Blanch', 'Budd', 'Buford', 'Bulah', 'Burley', 'Burr', 'Butler', 'Byrd', 'Caddie', 'Celie', 'Ceylon', 'Chalmers', 'Chancy', 'Charlotta', 'Chas', 'Chin', 'Christena', 'Cicero', 'Cinda', 'Clarance', 'Clarinda', 'Classie', 'Claud', 'Claudie', 'Claudine', 'Claus', 'Clemie', 'Clemmie', 'Cloyd', 'Clyda', 'Colonel', 'Commodore', 'Concepcion', 'Corda', 'Cordella', 'Cordia', 'Cordie', 'Corine', 'Cornelious', 'Creola', 'Dee', 'Dellar', 'Dewitt', 'Dicie', 'Dick', 'Dillard', 'Dillie', 'Docia', 'Dock', 'Doctor', 'Dolphus', 'Donie', 'Dosha', 'Doshia', 'Doshie', 'Dow', 'Drucilla', 'Drusilla', 'Duff', 'Earle', 'Early', 'Edd', 'Edmonia', 'Ednah', 'Edyth', 'Effa', 'Eldora', 'Electa', 'Elick', 'Eliga', 'Eligah', 'Elige', 'Elizebeth', 'Ellar', 'Elmire', 'Elmo', 'Elmore', 'Elzie', 'Emmer', 'Eola', 'Ephriam', 'Erasmus', 'Erastus', 'Ernst', 'Estell', 'Estie', 'Etha', 'Ethelyn', 'Ethyl', 'Etna', 'Etter', 'Fayette', 'Ferd', 'Finis', 'Firman', 'Flem', 'Fleming', 'Florance', 'Florida', 'Flossie', 'Floy', 'Freda', 'Friend', 'Frona', 'Fronie', 'Fronnie', 'Garfield', 'Gee', 'Gorge', 'Gottlieb', 'Green', 'Guss', 'Gussie', 'Gust', 'Gustavus', 'Hamp', 'Handy', 'Hardie', 'Harl', 'Harrie', 'Harriette', 'Harve', 'Hassie', 'Hedwig', 'Helma', 'Hence', 'Hermann', 'Hermine', 'Hessie', 'Hester', 'Hettie', 'Hillard', 'Hilliard', 'Hilma', 'Holmes', 'Hortense', 'Hosteen', 'Hoy', 'Hulda', 'Huldah', 'Hyman', 'Icy', 'Idell', 'Irven', 'Isham', 'Izora', 'Jennette', 'Josiephine', 'Junious', 'Kittie', 'Kizzie', 'Knute', 'Lafe', 'Lavenia', 'Lavonia', 'Lawyer', 'Leatha', 'Leitha', 'Lem', 'Lemma', 'Lessie', 'Letha', 'Letitia', 'Letta', 'Lew', 'Lewie', 'Liddie', 'Lidie', 'Lige', 'Liller', 'Lillis', 'Lissie', 'Littie', 'Lollie', 'Loma', 'Lon', 'Lonie', 'Lotta', 'Louetta', 'Loula', 'Louvenia', 'Lovisa', 'Ludie', 'Lue', 'Lugenia', 'Lular', 'Lulie', 'Lum', 'Lutie', 'Luvenia', 'Madge', 'Madie', 'Madora', 'Mal', 'Malvina', 'Mammie', 'Manda', 'Manerva', 'Manervia', 'Manford', 'Manie', 'Manley', 'Manly', 'Margaretta', 'Margeret', 'Mart', 'Mat', 'Math', 'Matie', 'Maymie', 'Media', 'Melville', 'Mertie', 'Merton', 'Meta', 'Mettie', 'Mignon', 'Milford', 'Mima', 'Minda', 'Miner', 'Minervia', 'Minor', 'Minta', 'Mintie', 'Missouri', 'Mittie', 'Mont', 'Mortimer', 'Morton', 'Mozella', 'Myrta', 'Myrtie', 'Myrtle', 'Nan', 'Nannie', 'Nat', 'Nealie', 'Neppie', 'Nimrod', 'Nolie', 'Nonie', 'Norval', 'Ocie', 'Octave', 'Oda', 'Olie', 'Omie', 'Onie', 'Oral', 'Orie', 'Orilla', 'Orley', 'Orpha', 'Orrie', 'Orval', 'Osie', 'Ossie', 'Otelia', 'Otha', 'Otho', 'Ottie', 'Pansy', 'Paralee', 'Pat', 'Pattie', 'Pearlie', 'Perley', 'Permelia', 'Pink', 'Pinkey', 'Pinkie', 'Pinkney', 'Pleas', 'Pleasant', 'Ples', 'Pollie', 'Press', 'Redden', 'Rella', 'Reta', 'Rice', 'Rillie', 'Roena', 'Rolla', 'Rosia', 'Rube', 'Sannie', 'Sim', 'Squire', 'Sudie', 'Sybilla', 'Sylvanus', 'Tella', 'Tempie', 'Tennie', 'Teressa', 'Theda', 'Theophile', 'Thursa', 'Tilmon', 'Tishie', 'Tobe', 'Ula', 'Vannie', 'Venie', 'Verdie', 'Vern', 'Verne', 'Vernie', 'Vertie', 'Viney', 'Virgie', 'Volney', 'Wash', 'Watt', 'Weaver', 'Wilburn', 'Wilda', 'Wilhelmine', 'Willam', 'Wm', 'Wong', 'Wood', 'Woodie', 'Worthy', 'Yee', 'Yetta', 'Zilpha']

C. Data Exploration

Let's extract the names that were given to both male and female babies and put it in a list named both.

s1 = set(names[names.sex == 'F']['name']) s2 = set(names[names.sex == 'M']['name']) both = list(s1.intersection(s2))
print(f"{len(both)} names in our dataset were given to both boys and girls. \n\n Here's a sample:\n") print(" | ".join(list(both)[:100]))
10663 names in our dataset were given to both boys and girls. Here's a sample: Ovis | Honor | Honore | Brandyn | Dorcus | Michel | Million | Donavan | Kemonie | Alie | Shalom | Sagan | Maeson | Kharis | Tenny | Daron | Madeline | Torey | Dandy | Nasir | Lavorn | Tracey | Ali | Kanya | Huey | Hershel | Kiren | Unnamed | Mose | Dakari | Navid | Krystian | Koi | Giavonni | Justine | Kamar | Derrell | Rajdeep | Leyton | Mong | Seattle | Vader | Anias | Camaree | Zimri | Khamauri | Jireh | Choyce | Cleo | Marki | Keontay | Shakari | Amarie | Jrue | Refugia | Luster | Levorn | Dillon | Joie | Dalas | Son | Graham | Samie | Larson | Kylynn | Kearney | Dejah | Teegan | Jahzel | Brunell | Delijah | Asiyah | Taelin | Rafa | Edward | Essie | Kenzie | Kwan | Honour | Britney | Shirl | Dara | Berthal | Keymoni | Madelyn | Dorey | Trinell | Maury | Skyylar | Courtnay | Lyncoln | Modie | Octavious | Jaice | Snow | Domique | Eva | Jeremy | Kobee | Sae

Upon a more careful examination of the whole output, some interesting observations can be made.

For instance I would expect names like Lucille to be only given to girls (or B.B. King's guitar), but according to our result, these names were also given to boys.

On the other hand, while to me the name Basil may evoke a cantankerous hotel manager in Torquay, apparently, some parents thought it the perfect moniker for their baby girl. Similarly, the name Sylvester was given to both boys and girls.

Let's have a closer look.

names[names.name == 'Lucille'].head()

name sex births year

248 Lucille F 40 1880-01-01

223 Lucille F 48 1881-01-01

170 Lucille F 85 1882-01-01

220 Lucille F 66 1883-01-01

188 Lucille F 94 1884-01-01

In how many years does the name Lucille appear for either boys or girls?

names[names.name == 'Lucille']['sex'].value_counts()
F 138 M 49 Name: sex, dtype: int64

Note:

Note that this is not the number of babies of each sex with that name.

We've grouped the data by year and sex so this count corresponds to the number of years in which at least five babies of either gender were named Lucille (5 being the yearly threshold beyond which a name isn't recorded in the data set).

We can compute the same information for other names.

names[names.name == 'Basil']['sex'].value_counts()
M 138 F 13 Name: sex, dtype: int64
names[names.name == 'Sylvester']['sex'].value_counts()
M 138 F 62 Name: sex, dtype: int64
names[names.name == 'Hannah']['sex'].value_counts()
F 138 M 38 Name: sex, dtype: int64

Let’s get the total number of babies named Basil for any given year?

basil = names[names.name == 'Basil'].pivot_table('births', index='year', columns='sex', fill_value=0, aggfunc=np.sum ) basil.head()

sex F M

year

1880-01-01 0 11

1881-01-01 0 9

1882-01-01 0 12

1883-01-01 0 8

1884-01-01 0 11

Let’s filter out the years where no girl was named Basil.

basil[basil.F != 0]

sex F M

year

1919-01-01 7 262

1927-01-01 5 249

2006-01-01 7 64

2008-01-01 5 31

2009-01-01 7 56

2010-01-01 11 47

2011-01-01 7 44

2012-01-01 20 45

2013-01-01 21 56

2014-01-01 17 44

2015-01-01 16 52

2016-01-01 22 60

2017-01-01 24 58

Note that we have 13 years, in agreement with the output of value_counts above.

Let’s visualise this for a couple of names. I’m using a log scale because otherwise only the data for the dominant gender would be visible.

basil.plot(figsize=(12, 6), logy=True, marker='o', lw=0, color=('red', 'green'), title='Basil', alpha=0.5);

We can daisy chain the name selection and the plotting.

(names[names.name == 'Christina'] .pivot_table('births', index='year', columns='sex', fill_value=0, aggfunc='sum') .plot(figsize=(12, 6), logy=True, marker='o', lw=0, color=('r', 'g'), title='Christina', alpha=0.5) );

Let’s put that code into a function to make it more convenient and follow the DRY principle.

def timeline_name(name='Basil', use_log=True): ax = (names[names.name == name] .pivot_table('births', index='year', columns='sex', fill_value=0, aggfunc=np.sum) .plot(figsize=(10, 6), logy=use_log, marker='o', lw=1, color=('r', 'g'), alpha=0.5) ) ax.set_xlabel('year', fontsize=14) ax.set_ylabel('count', fontsize=14) ax.set_title(name, fontsize=16)
timeline_name('Lucille')

timeline_name('Hannah')

D. The Seinfeld effect?

Note that a plausible explanation of why the dominant gender for a given name may have changed over the years can sometimes be found relatively easily, as in the case of the name Ashley...

For other names, a change in popularity may be due to a less conventional reason...

In an episode of Seinfeld George mentions to his friends that should he ever have a kid, he'd name him (or her) Seven, after the jersey number of Yankee baseball player Mickey Mantle.

First, let’s check whether “Seven” is in our data set.

'Seven' in names.name.unique()
True

Giddy up!

The Seinfeld episode was aired in 1996.

Let’s see if there’s any indication that it may have influenced the popularity of the name “Seven”.

First let’s create a date range over which to plot the data.

seinfeldYear = '1996' start = '1991' end = '2018' # Let's create a range of dates from start to end. date_range = pd.date_range(start, end, freq='A-JAN')
date_range
DatetimeIndex(['1991-01-31', '1992-01-31', '1993-01-31', '1994-01-31', '1995-01-31', '1996-01-31', '1997-01-31', '1998-01-31', '1999-01-31', '2000-01-31', '2001-01-31', '2002-01-31', '2003-01-31', '2004-01-31', '2005-01-31', '2006-01-31', '2007-01-31', '2008-01-31', '2009-01-31', '2010-01-31', '2011-01-31', '2012-01-31', '2013-01-31', '2014-01-31', '2015-01-31', '2016-01-31', '2017-01-31'], dtype='datetime64[ns]', freq='A-JAN')

Let’s visualise the trend for “Seven”.

# The data data = (names[names.name == 'Seven'] .pivot_table('births', index='year', columns='sex', fill_value=0, aggfunc=np.sum) ) # The base plot ax = data.plot(figsize=(12, 8), logy=False, marker='o', color=('r', 'g'), alpha=0.5, title='Seven', grid=False) # The vertical strip ax.axvline(seinfeldYear, ymin=0, ymax=1000, lw=15, color='orange', alpha=0.5) ax.set_xticks(date_range) ax.set_xlim([start, end]) # Ensure that the labels on the x axis match the years. ax.set_xticklabels(date_range.year, rotation=0, ha='center' ) # Annotate the figure with some text by specifying location using xy parameter ax.annotate('Episode "The Seven" is aired', xy=(pd.to_datetime(seinfeldYear, format="%Y"), 100), xycoords='data', rotation=90, horizontalalignment='center' ) plt.show()

Note that this does not prove that the show "created" the name fad (correlation ≠ causation and all that; there could be another reason behind both the increase in popularity of the name, and the use of the name in the show), but it does seem to indicate that it enhanced it...

Poor kids! Serenity now!

E. Mom & Pop culture

While we’re on this topic, can we find other possible influences of pop culture in the data set?

Let’s create a general function that we can use to plot various trends.

def plot_name_event(data = names, name="Seven", year='1996', start='1968', end='2013', event = '', freq='A-JAN'): date_range = pd.date_range(start, end, freq=freq) data = (names[names.name == name] .pivot_table('births', index='year', columns='sex', fill_value=0, aggfunc='sum') ) ax = data.plot(figsize=(14, 6), logy=False, marker='o', color=('r', 'g'), alpha=0.5, title = f"Name: {name} | {event} in {year}", grid=False) ax.axvline(year, ymin=0, ymax=data.max().max(), lw=15, color='orange', alpha=0.5) ax.set_xticks(date_range) ax.set_xlim([start, end]) ax.set_ylim([0, data.loc[data.index <= date_range[-1]].max().max()*1.1]) ax.set_xticklabels(date_range.year, rotation=90, ha='center' ) plt.show()

First, let’s look at a few movie characters/actors.

plot_name_event(name="Neo", year='1999', start='1990', event='The movie The Matrix is released')

plot_name_event(name="Trinity", year='1999', start='1990', event='The movie The Matrix is released')

How about further back in time?

Errol Flynn got his big break in Captain Blood back in 1935. Let’s see if all that swashbuckling influenced parents when it came to naming their progeny.

plot_name_event(name="Errol", year='1935', start='1931', end='1961', event='The movie Captain Blood is released')

Another possible Hollywood influence around the same time.

plot_name_event(name="Hedy", year='1938', start='1915', end='1971', event="The actress Hedy Lamarr made her Hollywood debut")

Earlier still

plot_name_event(name="Greta", year='1925', start='1910', end='1981', event="The actress Greta Garbo made her Hollywood debut")

Of course, we can’t talk about movies without talking about Star Wars.

Let’s see if we can track the names of some of the characters.

plot_name_event(name="Leia", year='1977', start='1970', end='1995', event='The movie Star Wars is released')

plot_name_event(name="Han", year='1977', start='1970', end='2015', event='The movie Star Wars is released')

plot_name_event(name="Lando", year='1977', start='1970', end='2015', event='The movie Star Wars is released')

Hmmmm… A bit surprising.

What about other characters from the trilogy?

Fortunately (for the kids) there doesn’t seem to be any Jabba or Jar Jar in our list… (TBH, I’m mildly disappointed).

characters = {"jabba", "yoda", "chewbacca", "jar jar"} set(names.name.str.lower().unique()) & characters
set()

Let’s look at the main character of another popular movie series.

plot_name_event(name="Indiana", year='1981', start='1970', end='2015', event="The movie Raiders Of The Lost Ark was released")

While on this topic…

plot_name_event(name="Harrison", year='1981', start='1975', end='2015', event="The movie Raiders Of The Lost Ark was released")

In a different genre we also have:

plot_name_event(name="Clint", year='1966', start='1950', end='1981', event="The movie The Good, the Bad, and the Ugly was released")

We’ve focused on movies, but of course, popular culture’s influence on baby names isn’t limited to movies or television.

Songs or singers can also contribute to the popularity of a given name.

Here are a couple of examples.

plot_name_event(name="Elvis", year='1955', start='1945', end='1991', event="Radios start playing Elvis Presley's songs")

plot_name_event(name="Michelle", year='1965', start='1960', end='1981', event='The song Michelle is released by The Beatles')

plot_name_event(name="Jermaine", year='1970', start='1960', end='1990', event="The Jackson 5 top the charts")

… And a couple of more recent ones

plot_name_event(name="Beyonce", year='1998', start='1995', end='2015', event="The group Destiny's Child released its debut album")

Same with sport.

plot_name_event(name="Kobe", year='1996', start='1988', end='2015', event="NBA player Kobe Bryant made his debut in the league")

I’ll stop here, but I’m sure many other interesting patterns can be found in this data set…

As I mentioned earlier, the fact that a name’s increase in popularity coincides with a particular event isn’t enough to demonstrate a causal relationship, however for some of these examples the coincidence is certainly interesting.

In any case, I’d like to think that hundreds, if not thousands, of people owe their moniker to George Costanza.

A sobering thought.

That’s ok. It could have been worse…

"Mulva" in names.name.unique()
False

I had to check!

python seinfeld

	name	sex	births	year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

	name	sex	births	year
0	Mary	F	7065	1880-01-01
1	Anna	F	2604	1880-01-01
2	Emma	F	2003	1880-01-01
3	Elizabeth	F	1939	1880-01-01
4	Minnie	F	1746	1880-01-01

sex	F	M
year
1880-01-01	90993	110491
1881-01-01	91953	100743
1882-01-01	107847	113686
1883-01-01	112319	104627
1884-01-01	129020	114442

	name	sex	births	year
248	Lucille	F	40	1880-01-01
223	Lucille	F	48	1881-01-01
170	Lucille	F	85	1882-01-01
220	Lucille	F	66	1883-01-01
188	Lucille	F	94	1884-01-01

sex	F	M
year
1919-01-01	7	262
1927-01-01	5	249
2006-01-01	7	64
2008-01-01	5	31
2009-01-01	7	56
2010-01-01	11	47
2011-01-01	7	44
2012-01-01	20	45
2013-01-01	21	56
2014-01-01	17	44
2015-01-01	16	52
2016-01-01	22	60
2017-01-01	24	58

Polyglot Data Wrangling

SQL vs Python vs R

The data

Which movie rentals were below a dollar for customers with an id above 500?

▶ SQL

▶ Python

▶ R

Sanity check

Which transactions occured in the month of March only?

▶ SQL

▶ Python

▶ R

Which day saw the most business (largest number of transactions)?

▶ SQL

▶ Python

▶ R

How much did each customer spend and when?

▶ SQL

▶ Python

▶ R

Which days of March 2007 recorded no sale?

▶ SQL

▶ Python

▶ R

How many transactions were there for each day of March 2007?

▶ SQL

▶ Python

▶ R

Days with no transactions in March 2007 - Alternate version

▶ SQL

▶ Python

▶ R

Which movies have never been rented?

▶ SQL

▶ Python

▶ R

Find each customer’s first order.

▶ SQL

▶ Python

▶ R

Handling multiple conditions.

▶ SQL

▶ Python

▶ R

Customer lifetime value

▶ SQL

▶ Python

▶ R

How many movies have a replacement cost above or below the average replacement cost?

▶ SQL

▶ Python

▶ R

Which movie generated the more revenue, for each viewer rating category?

▶ SQL

▶ Python

▶ R

Introduction to regular expressions

Jamie Zawinski:

1. Prelude:

Regular expressions are powerful...

...but at first, they can be puzzling.

2. Introduction

A. Literals

Remark:

Note:

B. Character classes

Note:

C. Metacharacters

D. Other built-in character classes and metacharacters

E. Capture Groups

3. Data Extraction

A. Named Capture Groups

Note:

B. Pandas & Regular expressions

Remark:

Pandas (Blood)groupby

1. Pandas GroupBy

Remark:

A. Dealing with dates.

Method 1 - datetime module.

Method 1 - `datetime` module.

Let’s use the Pandas `apply` method to format the dates in a consistent way

Method 2 - using the `to_datetime` method of pandas.

We can now define an `Age` feature by subtracting the year of birth from the current year.

We could compute the age using `apply` to iterate through the rows of the dataframe…

The `groupby` method creates a `GroupBy` object.

A different and nicer output can be obtained using the `agg` method on a groupby object.

For categorical variables, such as `BloodType`, basic information can be extracted using the `describe` method.

Note that with a `groupby` on multiple features we can obtain the previous result in a different way.

For descriptive statistics of numerical data, the `quantile` function is also useful.

We need to specify an index `slice`.

Same if you want to select `BloodType` as the level.

To extract just the values (without the index), use the `values` attribute.

Just like before, we can pass more features to the `groupby` method.