Polyglot Data Wrangling

SQL vs Python vs R

In this post we will look a how basic data wrangling operations can be performed in 3 languages commonly used in data science:

  • Structured Query Language (SQL)
  • Python
  • R

Note that while Python and R have obvious similarities in their approach to data wrangling, SQL is designed to work with relational data, where most of the time consuming operations are (ideally) performed on the database side rather than in the client’s memory. Both Python and R can process data in chunks, or even work with huge, distributed data sets through libraries like Dask (Python only) or Apache Spark (Python, R, Scala, Java). However, we’re mostly interested in illustrating some basic data wrangling pipelines in all three languages, so we’ll be using a small database that can easily fit in memory.

The Jupyter notebook allows us to run all our examples in one place. We’ll run an IPython kernel as our base platform and use an IPython magic command to run our R code within the same environment, and we’ll use sqlalchemy to run queries on a postgreSQL data base.

We won’t be covering the basics of each language in this post. There are many excellent resources online. For data wrangling the Python Data Science Handbook is an outstanding book. For R it’s hard to get past the wonderful R for Data Science by Garrett Grolemund and Hadley Wickham. For SQL there are more resources than we can mention, especially once we factor in the various vendor-specific flavours of SQL. One of my personal favourites is SQL Fundamentals for BI by Jeffrey James. Our SQL examples are based on some of the contents of that course.

We start by loading the Python required libraries (we’ll do all our data wrangling using the awesome pandas module) and create a connection to the PostgreSQL database using sqlalchemy. For our examples we’ll be using the well-known dvdrental database which contains 15 tables representing various aspects of a DVD rental business.

# Connect to a SQL (PostgreSQL) database
from sqlalchemy import create_engine
db = 'postgres:///dvdrental'
con = create_engine(db, echo=False)

# Import pandas for data wrangling in python
import pandas as pd
import warnings
warnings.simplefilter('ignore')

# use ipython magic command to load R into our environment.
%load_ext rpy2.ipython

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

All our R code will be preceded by %%R command to run it on the R interpreter from within our Jupyter notebook. We start by loading the tidyverse library (which is in fact a collection of libraries), and setting an option to improve how some outputs are rendered in the notebook.

%%R
library(tidyverse)
options(crayon.enabled = FALSE)

Because we’re running on top of an ipython kernel, we’ll write our SQL queries as strings and use pandas to run them on our database via sqlalchemy.


The data

First, let’s see what tables are in our DVD rental database.

q = '''
SELECT * FROM pg_catalog.pg_tables;
'''
df = pd.read_sql_query(q, con)
tables = df[~(df.tablename.str.startswith('pg_')|df.tablename.str.startswith('sql_'))].tablename.values
print(tables)
['category' 'country' 'film_category' 'language' 'inventory' 'actor'
 'staff' 'payment' 'rental' 'store' 'city' 'film' 'address' 'film_actor'
 'customer']

For convenience, let’s create a python dictionary with all the tables as dataframes.

Obviously, this means that we’re loading all the tables into memory, which is not what one would typically want to do with a SQL database, however, since our database is small, and we’re only interested in comparing the wrangling syntax across different languages, this is fine for our purpose.

tbl = {table: pd.read_sql_query(f'SELECT * FROM {table};', 
                                con, parse_dates=['payment_date']) 
       for table in tables}

Let’s look at a few rows of the payment table.

payment = tbl['payment']
payment.head()
payment_id customer_id staff_id rental_id amount payment_date
0 17503 341 2 1520 7.99 2007-02-15 22:25:46.996577
1 17504 341 1 1778 1.99 2007-02-16 17:23:14.996577
2 17505 341 1 1849 7.99 2007-02-16 22:41:45.996577
3 17506 341 2 2829 2.99 2007-02-19 19:39:56.996577
4 17507 341 2 3130 7.99 2007-02-20 17:31:48.996577

We can “pass” the payment dataframe to R in the following way:

%%R -i payment
head(payment)
  payment_id customer_id staff_id rental_id amount        payment_date
0      17503         341        2      1520   7.99 2007-02-15 22:25:46
1      17504         341        1      1778   1.99 2007-02-16 17:23:14
2      17505         341        1      1849   7.99 2007-02-16 22:41:45
3      17506         341        2      2829   2.99 2007-02-19 19:39:56
4      17507         341        2      3130   7.99 2007-02-20 17:31:48
5      17508         341        1      3382   5.99 2007-02-21 12:33:49

Now that we’ve got all the tables loaded, let’s run a few queries.

Let’s start with something simple.


Which movie rentals were below a dollar for customers with an id above 500?

▶ SQL

The query is nothing fancy here. We’re selecting a few features (columns) from the payment table, filtering on the dollar amount and customer id before ordering the rows.

q = '''
SELECT 
    p.payment_id, p.customer_id, p.amount 
FROM 
    payment p
WHERE 
    p.amount < 1 AND p.customer_id > 500
ORDER BY 1 ASC, 2 DESC, 3 ASC;
'''
df_sql = pd.read_sql_query(q, con)     # run the query on the database
print(df_sql.shape)                    # output the shape of the resulting table
df_sql.head()                          # display the first few rows
(413, 3)
payment_id customer_id amount
0 18121 502 0.99
1 18122 502 0.99
2 18125 503 0.99
3 18134 506 0.99
4 18144 510 0.99

▶ Python

Pandas allow us to chain methods on a dataframe, however, writing everything on a single line is cumbersome for long pipelines.

We can write each step of the pipeline on its own line by wrapping the pipeline in brackets.

df = (payment[(payment.amount < 1) & (payment.customer_id > 500)]  # select rows based on conditions
         .loc[:, ['payment_id','customer_id','amount']]            # select 3 columns
         .sort_values(by=['payment_id', 'customer_id', 'amount'],  # sort the rows according to the values  
                      ascending=[True, False, True])               #      in the columns
         .reset_index(drop=True)                                   # not really required but looks nicer
        )
print(df.shape)
df.head()
(413, 3)
payment_id customer_id amount
0 18121 502 0.99
1 18122 502 0.99
2 18125 503 0.99
3 18134 506 0.99
4 18144 510 0.99

▶ R

For this simple query the R code is similar to the Python one. The pipe symbol %>% is equivalent to the dot symbol in pandas which allows us to chain methods together.

The %%R -o df_r part of the code is a way to transfer the R dataframe back to our Python interpreter.

%%R -o df_r   
df_r = payment %>% 
        filter(amount < 1, customer_id > 500) %>% 
        select(payment_id, customer_id, amount) %>% 
        arrange(payment_id, desc(customer_id), amount)
df_r.shape
df_r.head()
payment_id customer_id amount
0 18121 502 0.99
1 18122 502 0.99
2 18125 503 0.99
3 18134 506 0.99
4 18144 510 0.99

Sanity check

Just to make sure that we’ve run our query correctly, we can compare the result from the SQL query with the results of the Python and R pipelines. Because of the fact that R indexing starts at 1 instead of 0, we need to reset the index of the R dataframe after we export it to Python.

all(df==df_sql), all(df_r==df_sql)
(True, True)

Any other trivial query could be handled in a similar way…

Let’s look at a query where we need to extract a feature that doesn’t appear explicitly in the original data.


Which transactions occured in the month of March only?

The payment table has a payment_date feature from which we will need to extract the month.

▶ SQL

We can extract the month value out of the payment_date column using the EXTRACT function.

q = '''
SELECT 
    p.payment_id, p.customer_id cust_id, p.amount, p.payment_date
FROM 
    payment p
WHERE 
    EXTRACT(month FROM p.payment_date) = 3
    AND p.amount < 1
ORDER BY 
    cust_id DESC, 3 ASC;
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()
(1021, 4)
payment_id cust_id amount payment_date
0 22611 598 0.99 2007-03-02 15:51:25.996577
1 22613 598 0.99 2007-03-21 07:09:41.996577
2 22606 597 0.99 2007-03-19 05:51:32.996577
3 22589 596 0.99 2007-03-01 20:50:37.996577
4 22598 596 0.99 2007-03-22 21:49:07.996577

▶ Python

When we loaded the payment table from the database using sqlalchemy, the payment_date feature was correctly parsed as a datetime object.

This allows us to use the pandas dt accessor to extract a variety of date and time related features, including the month.

df = (payment[(payment.amount < 1) & (payment.payment_date.dt.month==3)]
       .rename({'customer_id': 'cust_id'}, axis=1)
       .reindex(['payment_id','cust_id','amount','payment_date'], axis=1)
       .sort_values(['cust_id', 'amount'], ascending=[False, True])
       .reset_index(drop=True)
     )
      
print(df.shape)
df.head()
(1021, 4)
payment_id cust_id amount payment_date
0 22611 598 0.99 2007-03-02 15:51:25.996577
1 22613 598 0.99 2007-03-21 07:09:41.996577
2 22606 597 0.99 2007-03-19 05:51:32.996577
3 22589 596 0.99 2007-03-01 20:50:37.996577
4 22592 596 0.99 2007-03-17 15:08:26.996577

▶ R

A great way to handle datetime objects in R is via the lubridate library. To make it clear when we’re using a lubridate object, we won’t load the library, but we call its functions using the lubridate:: prefix.

%%R -o df_r
df_r = payment %>% 
        filter(amount < 1, lubridate::month(payment$payment_date)==3) %>% 
        rename(cust_id=customer_id) %>% 
        select(-rental_id, -staff_id) %>% 
        arrange(desc(cust_id), amount)
print(df_r.shape)
df_r.head()
(1021, 4)
payment_id cust_id amount payment_date
0 22611 598 0.99 2007-03-02 15:51:25+11:00
1 22613 598 0.99 2007-03-21 07:09:41+11:00
2 22606 597 0.99 2007-03-19 05:51:32+11:00
3 22589 596 0.99 2007-03-01 20:50:37+11:00
4 22592 596 0.99 2007-03-17 15:08:26+11:00

Here we get an glimpse into the infinite joy of working with datetime objects.

During the transfer from Python to R and back to Python, our payment_date format has changed slightly.

Instead of worrying about fixing this we’ll just exclude that column when comparing the R result with the SQL/Python ones.

Henceforth, we won’t be worrying about this issue. We’ll simply check that the dataframes shape match and the first few rows are consistent.

all(df==df_sql), all(df.iloc[:, :-1]==df_r.iloc[:, :-1])
(True, True)

Which day saw the most business (largest number of transactions)?

This is a common operation. We need to group our payments by date, count how many transactions occured on that date and order the results accordingly.

▶ SQL

q = '''
SELECT 
    p.payment_date::date, COUNT(*)
FROM 
    payment p
GROUP BY 1
ORDER BY 2 DESC
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql
(32, 2)
payment_date count
0 2007-04-30 1311
1 2007-03-01 676
2 2007-03-21 673
3 2007-04-27 643
4 2007-04-29 640
5 2007-03-19 631
6 2007-04-28 627
7 2007-03-18 624
8 2007-03-22 621
9 2007-03-20 611
10 2007-03-02 595
11 2007-03-17 584
12 2007-03-23 557
13 2007-04-08 516
14 2007-04-09 514
15 2007-04-06 486
16 2007-04-10 482
17 2007-04-07 472
18 2007-04-11 468
19 2007-04-12 452
20 2007-02-19 310
21 2007-02-15 308
22 2007-02-18 302
23 2007-02-20 291
24 2007-02-17 283
25 2007-02-16 282
26 2007-02-21 213
27 2007-05-14 182
28 2007-04-26 79
29 2007-03-16 72
30 2007-04-05 64
31 2007-02-14 27

▶ Python

The Python pipeline mimics the SQL query rather closely.

df = (payment
       .groupby(payment.payment_date.dt.date)['amount']
       .count().rename('count')
       .sort_values(ascending=False)
       .reset_index()
     )
df
payment_date count
0 2007-04-30 1311
1 2007-03-01 676
2 2007-03-21 673
3 2007-04-27 643
4 2007-04-29 640
5 2007-03-19 631
6 2007-04-28 627
7 2007-03-18 624
8 2007-03-22 621
9 2007-03-20 611
10 2007-03-02 595
11 2007-03-17 584
12 2007-03-23 557
13 2007-04-08 516
14 2007-04-09 514
15 2007-04-06 486
16 2007-04-10 482
17 2007-04-07 472
18 2007-04-11 468
19 2007-04-12 452
20 2007-02-19 310
21 2007-02-15 308
22 2007-02-18 302
23 2007-02-20 291
24 2007-02-17 283
25 2007-02-16 282
26 2007-02-21 213
27 2007-05-14 182
28 2007-04-26 79
29 2007-03-16 72
30 2007-04-05 64
31 2007-02-14 27

▶ R

We follow the same pattern for our R code. The conversion to a dataframe at the end (using as.data.frame) is simply to get a nicer formatting of the output.

%%R 
df_r = payment %>% 
        group_by(payment_date=lubridate::date(payment$payment_date)) %>% 
        summarise(n = n()) %>% 
        arrange(desc(n)) %>%
        as.data.frame()
df_r
   payment_date    n
1    2007-04-30 1311
2    2007-03-01  676
3    2007-03-21  673
4    2007-04-27  643
5    2007-04-29  640
6    2007-03-19  631
7    2007-04-28  627
8    2007-03-18  624
9    2007-03-22  621
10   2007-03-20  611
11   2007-03-02  595
12   2007-03-17  584
13   2007-03-23  557
14   2007-04-08  516
15   2007-04-09  514
16   2007-04-06  486
17   2007-04-10  482
18   2007-04-07  472
19   2007-04-11  468
20   2007-04-12  452
21   2007-02-19  310
22   2007-02-15  308
23   2007-02-18  302
24   2007-02-20  291
25   2007-02-17  283
26   2007-02-16  282
27   2007-02-21  213
28   2007-05-14  182
29   2007-04-26   79
30   2007-03-16   72
31   2007-04-05   64
32   2007-02-14   27

How much did each customer spend and when?

We can answer this question in a number of ways. Let’s return a sequence of transaction dates for each customer. In SQL we can do that with an array.

▶ SQL

q = '''
SELECT 
    p.customer_id, 
    SUM(p.amount) total, 
    ARRAY_AGG(p.payment_date::date) dates
FROM 
    payment p
GROUP BY 1
ORDER BY 2 DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()
(599, 3)
customer_id total dates
0 148 211.55 [2007-02-15, 2007-02-15, 2007-02-19, 2007-02-1...
1 526 208.58 [2007-02-15, 2007-02-16, 2007-02-17, 2007-02-1...
2 178 194.61 [2007-02-15, 2007-02-15, 2007-02-16, 2007-02-1...
3 137 191.62 [2007-02-18, 2007-02-19, 2007-02-20, 2007-02-2...
4 144 189.60 [2007-02-16, 2007-02-17, 2007-02-19, 2007-02-2...

▶ Python

In Python we’ll gather the transaction dates for each customer as a list.

We can build that list using a lambda (anonymous) function during the aggregation step.

df = (payment
       .groupby('customer_id')[['amount', 'payment_date']]
       .agg({
               'amount':'sum',
               'payment_date': lambda x: list(x.dt.date)
            })
       .rename(columns={'amount':'total', 'payment_date': 'date'})
       .sort_values('total', ascending=False)
       .reset_index()
      )
print(df.shape)
df.head()
(599, 3)
customer_id total date
0 148 211.55 [2007-02-15, 2007-02-15, 2007-02-19, 2007-02-1...
1 526 208.58 [2007-02-15, 2007-02-16, 2007-02-17, 2007-02-1...
2 178 194.61 [2007-02-15, 2007-02-15, 2007-02-16, 2007-02-1...
3 137 191.62 [2007-02-18, 2007-02-19, 2007-02-20, 2007-02-2...
4 144 189.60 [2007-02-16, 2007-02-17, 2007-02-19, 2007-02-2...

▶ R

For this query the R code is similar to the Python one (syntactic idiosyncrasies aside).

%%R 
df_r = payment %>% 
      mutate(payment_date = as.Date(payment_date)) %>% 
      group_by(customer_id) %>% 
      summarise(total=sum(amount), dates=list(payment_date)) %>% 
      arrange(desc(total))

print(dim(df_r))
print(head(df_r, 5))
[1] 599   3
# A tibble: 5 x 3
  customer_id total dates      
        <int> <dbl> <list>     
1         148  212. <date [45]>
2         526  209. <date [42]>
3         178  195. <date [39]>
4         137  192. <date [38]>
5         144  190. <date [40]>

At first glance it looks like the results are different from the SQL/Python, but this is purely cosmetic. The R data structure (a tibble actually) rounds up the floating point values and shrinks the lists when displaying the results. We could convert the tibble to a R dataframe but the output would be unwieldy. Instead, we can unpack the first row to show that the data has been parsed correctly.

%%R
print(df_r$total[1]) 
print(df_r$dates[1])
[1] 211.55
[[1]]
 [1] "2007-02-15" "2007-02-15" "2007-02-19" "2007-02-19" "2007-02-19"
 [6] "2007-03-01" "2007-03-02" "2007-03-17" "2007-03-17" "2007-03-18"
[11] "2007-03-18" "2007-03-18" "2007-03-20" "2007-03-20" "2007-03-20"
[16] "2007-03-21" "2007-03-21" "2007-03-21" "2007-03-21" "2007-03-22"
[21] "2007-03-22" "2007-03-22" "2007-03-22" "2007-04-05" "2007-04-06"
[26] "2007-04-08" "2007-04-08" "2007-04-08" "2007-04-09" "2007-04-10"
[31] "2007-04-11" "2007-04-12" "2007-04-27" "2007-04-27" "2007-04-27"
[36] "2007-04-28" "2007-04-28" "2007-04-28" "2007-04-29" "2007-04-29"
[41] "2007-04-29" "2007-04-29" "2007-04-30" "2007-04-29" "2007-04-30"

Which days of March 2007 recorded no sale?

This may seem like the sort of query we’ve already looked at, however there is a catch. The database only records transactions that have occurred (it seems obvious when said like that!). This means that dates with no sales aren’t in the database. Therefore we need to generate a series of dates (with a daily frequency) and join our payment data on it to find the days with no sales.

In PostgreSQL we can use generate_series to do that. Note that we select from it as if it were a table (which we name gs). By joining on it we can “expand” our payment data to include the dates for which no transaction took place. These would of course contain NULL values for the transaction data. By identifying where those NULL rows are we can answer the original question.

▶ SQL

q = '''
SELECT 
    gs::date, p.*
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date
WHERE 
    payment_date is NULL
'''
df_sql = pd.read_sql_query(q, con).fillna('')['gs']
print(df_sql.shape)
df_sql
(21,)
0     2007-03-03
1     2007-03-04
2     2007-03-05
3     2007-03-06
4     2007-03-07
5     2007-03-08
6     2007-03-09
7     2007-03-10
8     2007-03-11
9     2007-03-12
10    2007-03-13
11    2007-03-14
12    2007-03-15
13    2007-03-24
14    2007-03-25
15    2007-03-26
16    2007-03-27
17    2007-03-28
18    2007-03-29
19    2007-03-30
20    2007-03-31
Name: gs, dtype: object

▶ Python

We could do something similar in Python, however let’s try a more pythonic approach and use set operations to extract the dates we want. We simply construct the set of all dates in March 2007 and take the difference with the dates that are already in the database. The moral of the story here is that whereas the Python (and R) syntax can often look SQLish, this is not always the most concise or even suitable way to get to the answer.

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = pd.DataFrame(index = set(gs.date) - set(payment.payment_date.dt.date.values)).sort_index()
print(df.shape)
df
(21, 0)
2007-03-03
2007-03-04
2007-03-05
2007-03-06
2007-03-07
2007-03-08
2007-03-09
2007-03-10
2007-03-11
2007-03-12
2007-03-13
2007-03-14
2007-03-15
2007-03-24
2007-03-25
2007-03-26
2007-03-27
2007-03-28
2007-03-29
2007-03-30
2007-03-31

▶ R

Set operations also work in R. The output is a bit strange. We have a list of lists with a single element. We could flatten the nested list but it would only obfuscate the code so we’ll leave it like that.

%%R
gs = seq(lubridate::ymd('2007-03-01'), lubridate::ymd('2007-03-31'), by = '1 day')
lapply(sort(setdiff(gs, lubridate::date(payment$payment_date))), as.Date, origin='1970-01-01')
[[1]]
[1] "2007-03-03"

[[2]]
[1] "2007-03-04"

[[3]]
[1] "2007-03-05"

[[4]]
[1] "2007-03-06"

[[5]]
[1] "2007-03-07"

[[6]]
[1] "2007-03-08"

[[7]]
[1] "2007-03-09"

[[8]]
[1] "2007-03-10"

[[9]]
[1] "2007-03-11"

[[10]]
[1] "2007-03-12"

[[11]]
[1] "2007-03-13"

[[12]]
[1] "2007-03-14"

[[13]]
[1] "2007-03-15"

[[14]]
[1] "2007-03-24"

[[15]]
[1] "2007-03-25"

[[16]]
[1] "2007-03-26"

[[17]]
[1] "2007-03-27"

[[18]]
[1] "2007-03-28"

[[19]]
[1] "2007-03-29"

[[20]]
[1] "2007-03-30"

[[21]]
[1] "2007-03-31"

How many transactions were there for each day of March 2007?

This is a simple question, we just need to keep in mind that some days had no transaction. We should still ouput those with a count of zero. The basic idea is similar to the previous example. We create a time series to “pad out” the missing days in the payment_date feature.

▶ SQL

q = '''
SELECT 
    gs::date date, 
    COUNT(p.*) number_of_transactions
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date 
GROUP BY 
    gs::date 
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql
(31, 2)
date number_of_transactions
0 2007-03-01 676
1 2007-03-02 595
2 2007-03-03 0
3 2007-03-04 0
4 2007-03-05 0
5 2007-03-06 0
6 2007-03-07 0
7 2007-03-08 0
8 2007-03-09 0
9 2007-03-10 0
10 2007-03-11 0
11 2007-03-12 0
12 2007-03-13 0
13 2007-03-14 0
14 2007-03-15 0
15 2007-03-16 72
16 2007-03-17 584
17 2007-03-18 624
18 2007-03-19 631
19 2007-03-20 611
20 2007-03-21 673
21 2007-03-22 621
22 2007-03-23 557
23 2007-03-24 0
24 2007-03-25 0
25 2007-03-26 0
26 2007-03-27 0
27 2007-03-28 0
28 2007-03-29 0
29 2007-03-30 0
30 2007-03-31 0

▶ Python

Similarly, in Python we can create the series of dates for March 2007 using the date_range object from pandas.

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = (payment
       .assign(payment_date=pd.to_datetime(payment.payment_date).dt.date)
       .groupby('payment_date')
       .agg({'amount':'count'})
       .reindex(gs, fill_value=0)
       .reset_index()
       .rename(columns={'index':'date','amount':'number_of_transactions'})
)

print(df.shape)
df
(31, 2)
date number_of_transactions
0 2007-03-01 676
1 2007-03-02 595
2 2007-03-03 0
3 2007-03-04 0
4 2007-03-05 0
5 2007-03-06 0
6 2007-03-07 0
7 2007-03-08 0
8 2007-03-09 0
9 2007-03-10 0
10 2007-03-11 0
11 2007-03-12 0
12 2007-03-13 0
13 2007-03-14 0
14 2007-03-15 0
15 2007-03-16 72
16 2007-03-17 584
17 2007-03-18 624
18 2007-03-19 631
19 2007-03-20 611
20 2007-03-21 673
21 2007-03-22 621
22 2007-03-23 557
23 2007-03-24 0
24 2007-03-25 0
25 2007-03-26 0
26 2007-03-27 0
27 2007-03-28 0
28 2007-03-29 0
29 2007-03-30 0
30 2007-03-31 0

▶ R

Once again, the pattern for R follows the same principles.

%%R 
gs = data_frame(payment_date=seq(lubridate::ymd('2007-03-01'), 
                                 lubridate::ymd('2007-03-31'), 
                                 by = '1 day'))

df_r = payment %>% 
      mutate(payment_date=lubridate::date(payment_date)) %>% 
      group_by(payment_date) %>% 
      summarise(number_of_transactions=n()) %>% 
      right_join(gs, by='payment_date') %>%
      replace_na(list(number_of_transactions=0))

as.data.frame(df_r)
   payment_date number_of_transactions
1    2007-03-01                    676
2    2007-03-02                    595
3    2007-03-03                      0
4    2007-03-04                      0
5    2007-03-05                      0
6    2007-03-06                      0
7    2007-03-07                      0
8    2007-03-08                      0
9    2007-03-09                      0
10   2007-03-10                      0
11   2007-03-11                      0
12   2007-03-12                      0
13   2007-03-13                      0
14   2007-03-14                      0
15   2007-03-15                      0
16   2007-03-16                     72
17   2007-03-17                    584
18   2007-03-18                    624
19   2007-03-19                    631
20   2007-03-20                    611
21   2007-03-21                    673
22   2007-03-22                    621
23   2007-03-23                    557
24   2007-03-24                      0
25   2007-03-25                      0
26   2007-03-26                      0
27   2007-03-27                      0
28   2007-03-28                      0
29   2007-03-29                      0
30   2007-03-30                      0
31   2007-03-31                      0

Days with no transactions in March 2007 - Alternate version

Let’s revisit the problem we dealt with previous using set operations in Python and R and illustrate how we could answer the question using a more SQLish pipeline. We can reuse the code from the previous example. We just need to add a filtering step.

▶ SQL

q = '''
SELECT 
    gs::date date, COUNT(p.*) number_of_transactions
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date 
GROUP BY 
    gs::date 
HAVING 
    COUNT(p.*) = 0
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql
(21, 2)
date number_of_transactions
0 2007-03-03 0
1 2007-03-04 0
2 2007-03-05 0
3 2007-03-06 0
4 2007-03-07 0
5 2007-03-08 0
6 2007-03-09 0
7 2007-03-10 0
8 2007-03-11 0
9 2007-03-12 0
10 2007-03-13 0
11 2007-03-14 0
12 2007-03-15 0
13 2007-03-24 0
14 2007-03-25 0
15 2007-03-26 0
16 2007-03-27 0
17 2007-03-28 0
18 2007-03-29 0
19 2007-03-30 0
20 2007-03-31 0

▶ Python

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = (payment
       .assign(payment_date=pd.to_datetime(payment.payment_date).dt.date)
       .groupby('payment_date')
       .agg({'amount':'count'})
       .rename(columns={'amount':'number_of_transactions'})
       .reindex(gs, fill_value=0)
       .query('number_of_transactions == 0')  
)

print(df.shape)
df
(21, 1)
number_of_transactions
2007-03-03 0
2007-03-04 0
2007-03-05 0
2007-03-06 0
2007-03-07 0
2007-03-08 0
2007-03-09 0
2007-03-10 0
2007-03-11 0
2007-03-12 0
2007-03-13 0
2007-03-14 0
2007-03-15 0
2007-03-24 0
2007-03-25 0
2007-03-26 0
2007-03-27 0
2007-03-28 0
2007-03-29 0
2007-03-30 0
2007-03-31 0

▶ R

%%R
gs = data_frame(payment_date=seq(lubridate::ymd('2007-03-01'), 
                                 lubridate::ymd('2007-03-31'), 
                                 by = '1 day'))

df_r = payment %>% 
      mutate(payment_date=lubridate::date(payment_date)) %>% 
      group_by(payment_date) %>% 
      summarise(number_of_transactions=n()) %>% 
      right_join(gs, by="payment_date") %>%
      replace_na(list(number_of_transactions=0)) %>%
      subset(number_of_transactions==0)

print(dim(df_r))
as.data.frame(df_r)
[1] 21  2
   payment_date number_of_transactions
1    2007-03-03                      0
2    2007-03-04                      0
3    2007-03-05                      0
4    2007-03-06                      0
5    2007-03-07                      0
6    2007-03-08                      0
7    2007-03-09                      0
8    2007-03-10                      0
9    2007-03-11                      0
10   2007-03-12                      0
11   2007-03-13                      0
12   2007-03-14                      0
13   2007-03-15                      0
14   2007-03-24                      0
15   2007-03-25                      0
16   2007-03-26                      0
17   2007-03-27                      0
18   2007-03-28                      0
19   2007-03-29                      0
20   2007-03-30                      0
21   2007-03-31                      0

Which movies have never been rented?

So far we’ve only worked with a single table out of the 15 contained in the database. To match movies to rental transactions we will need to use 3 new tables. The film one contains information about the movies, the inventory one links the movies to an inventory id, which is used in the rental table to connect movies to actual rentals.

Let’s have a look at the structures of these tables.

for t in ['inventory', 'payment', 'rental', 'film']:
    q = f'''
SELECT * FROM {t}
LIMIT 1
'''
    df = pd.read_sql_query(q, con)
    print(t)
    display(df)
inventory
inventory_id film_id store_id last_update
0 1 1 1 2006-02-15 10:09:17
payment
payment_id customer_id staff_id rental_id amount payment_date
0 17503 341 2 1520 7.99 2007-02-15 22:25:46.996577
rental
rental_id rental_date inventory_id customer_id return_date staff_id last_update
0 2 2005-05-24 22:54:33 1525 459 2005-05-28 19:40:33 1 2006-02-16 02:30:53
film
film_id title description release_year language_id rental_duration rental_rate length replacement_cost rating last_update special_features fulltext
0 133 Chamber Italian A Fateful Reflection of a Moose And a Husband ... 2006 1 7 4.99 117 14.99 NC-17 2013-05-26 14:50:58.951 [Trailers] 'chamber':1 'fate':4 'husband':11 'italian':2 ...

▶ SQL

Let’s join the tables, count how many times each movie has been rented and select those for which the count is zero.

q = '''
SELECT 
    t.film_id, t.title, t.rentals 
FROM 
    (SELECT 
        f.film_id, f.title, 
        COUNT(distinct r.rental_id) as rentals
    FROM film f
        LEFT JOIN inventory i ON i.film_id = f.film_id
        LEFT JOIN rental r ON r.inventory_id = i.inventory_id
    GROUP BY 1,2
    HAVING 
        COUNT(distinct r.rental_id) = 0
        ) t
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql
(42, 3)
film_id title rentals
0 14 Alice Fantasia 0
1 33 Apollo Teen 0
2 36 Argonauts Town 0
3 38 Ark Ridgemont 0
4 41 Arsenic Independence 0
5 87 Boondock Ballroom 0
6 108 Butch Panther 0
7 128 Catch Amistad 0
8 144 Chinatown Gladiator 0
9 148 Chocolate Duck 0
10 171 Commandments Express 0
11 192 Crossing Divorce 0
12 195 Crowds Telemark 0
13 198 Crystal Breaking 0
14 217 Dazed Punk 0
15 221 Deliverance Mulholland 0
16 318 Firehouse Vietnam 0
17 325 Floats Garden 0
18 332 Frankenstein Stranger 0
19 359 Gladiator Westward 0
20 386 Gump Date 0
21 404 Hate Handicap 0
22 419 Hocus Frida 0
23 495 Kentuckian Giant 0
24 497 Kill Brotherhood 0
25 607 Muppet Mile 0
26 642 Order Betrayed 0
27 669 Pearl Destiny 0
28 671 Perdition Fargo 0
29 701 Psycho Shrunk 0
30 712 Raiders Antitrust 0
31 713 Rainbow Shock 0
32 742 Roof Champion 0
33 801 Sister Freddy 0
34 802 Sky Miracle 0
35 860 Suicides Silence 0
36 874 Tadpole Park 0
37 909 Treasure Command 0
38 943 Villain Desperate 0
39 950 Volume House 0
40 954 Wake Jaws 0
41 955 Walls Artist 0

▶ Python

The go to approach for join operations in pandas is the merge method. The type of join can be specified as an optional parameter.

film, inventory, rental = tbl['film'], tbl['inventory'], tbl['rental']

df = (film
        .merge(inventory, on='film_id', how='left')
        .merge(rental, on='inventory_id', how='left')
        .groupby(['film_id','title'])[['rental_id']]
        .count()
        .rename(columns={'rental_id':'rentals'})
        .query('rentals == 0')
        .reset_index()
     )
print(df.shape)
df
(42, 3)
film_id title rentals
0 14 Alice Fantasia 0
1 33 Apollo Teen 0
2 36 Argonauts Town 0
3 38 Ark Ridgemont 0
4 41 Arsenic Independence 0
5 87 Boondock Ballroom 0
6 108 Butch Panther 0
7 128 Catch Amistad 0
8 144 Chinatown Gladiator 0
9 148 Chocolate Duck 0
10 171 Commandments Express 0
11 192 Crossing Divorce 0
12 195 Crowds Telemark 0
13 198 Crystal Breaking 0
14 217 Dazed Punk 0
15 221 Deliverance Mulholland 0
16 318 Firehouse Vietnam 0
17 325 Floats Garden 0
18 332 Frankenstein Stranger 0
19 359 Gladiator Westward 0
20 386 Gump Date 0
21 404 Hate Handicap 0
22 419 Hocus Frida 0
23 495 Kentuckian Giant 0
24 497 Kill Brotherhood 0
25 607 Muppet Mile 0
26 642 Order Betrayed 0
27 669 Pearl Destiny 0
28 671 Perdition Fargo 0
29 701 Psycho Shrunk 0
30 712 Raiders Antitrust 0
31 713 Rainbow Shock 0
32 742 Roof Champion 0
33 801 Sister Freddy 0
34 802 Sky Miracle 0
35 860 Suicides Silence 0
36 874 Tadpole Park 0
37 909 Treasure Command 0
38 943 Villain Desperate 0
39 950 Volume House 0
40 954 Wake Jaws 0
41 955 Walls Artist 0

▶ R

We start by “transfering” the new dataframes to R. The rest of the pipeline is very much like the one used in Python.

%%R -i rental -i inventory -i film -o df_r
df_r = film %>% 
        left_join(inventory, by="film_id") %>% 
        left_join(rental, by="inventory_id") %>% 
        group_by(film_id, title) %>% 
        summarise(rentals=n_distinct(rental_id, na.rm=TRUE)) %>% 
        filter(rentals==0) %>%
        as.data.frame()

df_r
   film_id                  title rentals
1       14         Alice Fantasia       0
2       33            Apollo Teen       0
3       36         Argonauts Town       0
4       38          Ark Ridgemont       0
5       41   Arsenic Independence       0
6       87      Boondock Ballroom       0
7      108          Butch Panther       0
8      128          Catch Amistad       0
9      144    Chinatown Gladiator       0
10     148         Chocolate Duck       0
11     171   Commandments Express       0
12     192       Crossing Divorce       0
13     195        Crowds Telemark       0
14     198       Crystal Breaking       0
15     217             Dazed Punk       0
16     221 Deliverance Mulholland       0
17     318      Firehouse Vietnam       0
18     325          Floats Garden       0
19     332  Frankenstein Stranger       0
20     359     Gladiator Westward       0
21     386              Gump Date       0
22     404          Hate Handicap       0
23     419            Hocus Frida       0
24     495       Kentuckian Giant       0
25     497       Kill Brotherhood       0
26     607            Muppet Mile       0
27     642         Order Betrayed       0
28     669          Pearl Destiny       0
29     671        Perdition Fargo       0
30     701          Psycho Shrunk       0
31     712      Raiders Antitrust       0
32     713          Rainbow Shock       0
33     742          Roof Champion       0
34     801          Sister Freddy       0
35     802            Sky Miracle       0
36     860       Suicides Silence       0
37     874           Tadpole Park       0
38     909       Treasure Command       0
39     943      Villain Desperate       0
40     950           Volume House       0
41     954              Wake Jaws       0
42     955           Walls Artist       0
all(df==df_sql), all(df==df_r)
(True, True)

Find each customer’s first order.

To solve the problem in SQL we need to use a correlated subquery whereby the inner query gets executed for every row of the outer query.

In our case, the inner query gets executed for every row of the outer query because min(rental_id) will change with each customer.

▶ SQL

q = '''
SELECT 
    r.customer_id, 
    min(r.rental_id) first_order_id, 
    (    
        SELECT r2.rental_date::date FROM rental r2 
        WHERE r2.rental_id = min(r.rental_id)
    ) first_order_date
FROM 
    rental r
GROUP BY 1
ORDER BY 1
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
display(df_sql.head())
(599, 3)
customer_id first_order_id first_order_date
0 1 76 2005-05-25
1 2 320 2005-05-27
2 3 435 2005-05-27
3 4 1297 2005-06-15
4 5 731 2005-05-29

▶ Python

In this case the Python syntax is a bit simpler. We group by the customer_id and pick the smallest rental_id.

df = (rental
        .assign(rental_date=rental.rental_date.dt.date)
        .groupby(['customer_id' ])['rental_id','rental_date']
        .min()
        .reset_index()
        .rename(columns={'rental_id':'first_order_id', 'rental_date':'first_order_date'})
      )

print(df.shape)
display(df.head())
(599, 3)
customer_id first_order_id first_order_date
0 1 76 2005-05-25
1 2 320 2005-05-27
2 3 435 2005-05-27
3 4 1297 2005-06-15
4 5 731 2005-05-29
all(df==df_sql)
True

▶ R

The R version of the code is similar to the Python one.

%%R 

df_r = rental %>% 
        mutate(rental_date=lubridate::date(rental_date)) %>% 
        group_by(customer_id) %>% 
        summarise_at(vars(rental_id, rental_date), funs(min)) %>%
        as.data.frame()

print(dim(df_r))
print(head(df_r, 5))
[1] 599   3
  customer_id rental_id rental_date
1           1        76  2005-05-25
2           2       320  2005-05-27
3           3       435  2005-05-27
4           4      1297  2005-06-15
5           5       731  2005-05-29

Handling multiple conditions.

Let’s look at a more complex query. We’d like to return all the rental transactions conducted by the staff with id 1, for which the rental id is larger than 15000, and the payment occured between 3am and 4am or 11pm and midnight in March 2007.

▶ SQL

Although more complex than our previous examples, this query can be built from the pieces we’ve used before. To keep things somewhat simple, we’ll assume that between 3am and 4am means between 3:00 and 3:59. This will allow use to only concern ourselves with the hour number when filtering our data. Same with the second time period.

q = '''
WITH base_table AS (
    SELECT gs::date AS date,  p.*
    FROM 
        generate_series('2007-03-01', '2007-03-31', INTERVAL '1 day') as gs
    LEFT JOIN payment p ON p.payment_date::date=gs::date AND p.staff_id = 1
    ORDER BY 1 NULLS FIRST
) 
SELECT 
    bt.date, bt.payment_id, bt.customer_id, bt.staff_id, bt.rental_id, bt.amount, 
    EXTRACT(hour FROM bt.payment_date)::int AS hour
FROM 
    base_table bt
WHERE 
    bt.rental_id > 15000 
    AND         
    EXTRACT(hour FROM bt.payment_date) IN (4,23)
ORDER BY bt.payment_date, bt.rental_id
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql
(36, 7)
date payment_id customer_id staff_id rental_id amount hour
0 2007-03-22 23134 46 1 15438 2.99 23
1 2007-03-22 23958 136 1 15439 4.99 23
2 2007-03-22 23095 42 1 15442 2.99 23
3 2007-03-22 22615 598 1 15443 7.99 23
4 2007-03-22 22960 28 1 15445 4.99 23
5 2007-03-22 20574 379 1 15446 4.99 23
6 2007-03-22 21651 490 1 15448 2.99 23
7 2007-03-22 20027 322 1 15450 0.99 23
8 2007-03-22 21866 514 1 15451 2.99 23
9 2007-03-22 20370 359 1 15453 1.99 23
10 2007-03-22 24882 238 1 15455 0.99 23
11 2007-03-22 25113 262 1 15456 0.99 23
12 2007-03-22 19906 306 1 15457 2.99 23
13 2007-03-22 22876 20 1 15460 2.99 23
14 2007-03-22 23646 103 1 15461 5.99 23
15 2007-03-22 20675 389 1 15462 5.99 23
16 2007-03-22 23869 127 1 15463 5.99 23
17 2007-03-22 23283 62 1 15464 6.99 23
18 2007-03-22 21921 520 1 15465 0.99 23
19 2007-03-22 20970 418 1 15466 4.99 23
20 2007-03-22 23647 103 1 15467 3.99 23
21 2007-03-22 20768 399 1 15468 4.99 23
22 2007-03-22 22610 597 1 15469 4.99 23
23 2007-03-23 24692 215 1 15583 2.99 4
24 2007-03-23 21731 500 1 15584 2.99 4
25 2007-03-23 22149 545 1 15585 0.99 4
26 2007-03-23 21723 499 1 15587 7.99 4
27 2007-03-23 21764 503 1 15588 3.99 4
28 2007-03-23 22901 22 1 15589 6.99 4
29 2007-03-23 23284 62 1 15591 0.99 4
30 2007-03-23 24163 153 1 15593 1.99 4
31 2007-03-23 25135 264 1 15595 4.99 4
32 2007-03-23 24478 186 1 15596 0.99 4
33 2007-03-23 23323 66 1 15598 8.99 4
34 2007-03-23 23648 103 1 15599 5.99 4
35 2007-03-23 23729 113 1 15600 1.99 4

▶ Python

Just for some variety we use the between method of pandas series to select the March 2007 data. We pass pandas Timestamp objects to mark the first and last day of the month. The rest of the pipeline should be self-explanatory. The last line is added for convenience. We order the columns of the pandas dataframe to match the ordering of the columns in our SQL result.

df = (payment
  .assign(date=payment.payment_date.dt.date)
  .loc[payment.payment_date.between(
              pd.Timestamp(year=2007, month=3, day=1), 
              pd.Timestamp(year=2007, month=3, day=31))
      ]
  .assign(hour=payment.payment_date.dt.hour)
  .query('(hour in (4, 23)) & (rental_id > 15000) & (staff_id == 1)')
  .sort_values('payment_date')
  .reset_index()
  .loc[:, df_sql.columns]
) 

print(df.shape)
df
(36, 7)
date payment_id customer_id staff_id rental_id amount hour
0 2007-03-22 23134 46 1 15438 2.99 23
1 2007-03-22 23958 136 1 15439 4.99 23
2 2007-03-22 23095 42 1 15442 2.99 23
3 2007-03-22 22615 598 1 15443 7.99 23
4 2007-03-22 22960 28 1 15445 4.99 23
5 2007-03-22 20574 379 1 15446 4.99 23
6 2007-03-22 21651 490 1 15448 2.99 23
7 2007-03-22 20027 322 1 15450 0.99 23
8 2007-03-22 21866 514 1 15451 2.99 23
9 2007-03-22 20370 359 1 15453 1.99 23
10 2007-03-22 24882 238 1 15455 0.99 23
11 2007-03-22 25113 262 1 15456 0.99 23
12 2007-03-22 19906 306 1 15457 2.99 23
13 2007-03-22 22876 20 1 15460 2.99 23
14 2007-03-22 23646 103 1 15461 5.99 23
15 2007-03-22 20675 389 1 15462 5.99 23
16 2007-03-22 23869 127 1 15463 5.99 23
17 2007-03-22 23283 62 1 15464 6.99 23
18 2007-03-22 21921 520 1 15465 0.99 23
19 2007-03-22 20970 418 1 15466 4.99 23
20 2007-03-22 23647 103 1 15467 3.99 23
21 2007-03-22 20768 399 1 15468 4.99 23
22 2007-03-22 22610 597 1 15469 4.99 23
23 2007-03-23 24692 215 1 15583 2.99 4
24 2007-03-23 21731 500 1 15584 2.99 4
25 2007-03-23 22149 545 1 15585 0.99 4
26 2007-03-23 21723 499 1 15587 7.99 4
27 2007-03-23 21764 503 1 15588 3.99 4
28 2007-03-23 22901 22 1 15589 6.99 4
29 2007-03-23 23284 62 1 15591 0.99 4
30 2007-03-23 24163 153 1 15593 1.99 4
31 2007-03-23 25135 264 1 15595 4.99 4
32 2007-03-23 24478 186 1 15596 0.99 4
33 2007-03-23 23323 66 1 15598 8.99 4
34 2007-03-23 23648 103 1 15599 5.99 4
35 2007-03-23 23729 113 1 15600 1.99 4
all(df==df_sql)
True

▶ R

Once again a similar pipeline can be build in R.

%%R -o df_r

df_r = payment %>% 
  mutate(date=lubridate::date(payment_date)) %>% 
  filter(date >= as.Date('2007-03-01') &  date <= as.Date('2007-03-31')) %>% 
  mutate(hour=lubridate::hour(payment_date)) %>% 
  filter(hour %in% c(4, 23) & rental_id > 15000 & staff_id == 1) %>% 
  arrange(payment_date) %>% 
  select(date, payment_id, customer_id, staff_id, rental_id, amount, hour) %>% 
  as.data.frame()

print(dim(df_r))
df_r
[1] 36  7
         date payment_id customer_id staff_id rental_id amount hour
1  2007-03-22      23134          46        1     15438   2.99   23
2  2007-03-22      23958         136        1     15439   4.99   23
3  2007-03-22      23095          42        1     15442   2.99   23
4  2007-03-22      22615         598        1     15443   7.99   23
5  2007-03-22      22960          28        1     15445   4.99   23
6  2007-03-22      20574         379        1     15446   4.99   23
7  2007-03-22      21651         490        1     15448   2.99   23
8  2007-03-22      20027         322        1     15450   0.99   23
9  2007-03-22      21866         514        1     15451   2.99   23
10 2007-03-22      20370         359        1     15453   1.99   23
11 2007-03-22      24882         238        1     15455   0.99   23
12 2007-03-22      25113         262        1     15456   0.99   23
13 2007-03-22      19906         306        1     15457   2.99   23
14 2007-03-22      22876          20        1     15460   2.99   23
15 2007-03-22      23646         103        1     15461   5.99   23
16 2007-03-22      20675         389        1     15462   5.99   23
17 2007-03-22      23869         127        1     15463   5.99   23
18 2007-03-22      23283          62        1     15464   6.99   23
19 2007-03-22      21921         520        1     15465   0.99   23
20 2007-03-22      20970         418        1     15466   4.99   23
21 2007-03-22      23647         103        1     15467   3.99   23
22 2007-03-22      20768         399        1     15468   4.99   23
23 2007-03-22      22610         597        1     15469   4.99   23
24 2007-03-23      24692         215        1     15583   2.99    4
25 2007-03-23      21731         500        1     15584   2.99    4
26 2007-03-23      22149         545        1     15585   0.99    4
27 2007-03-23      21723         499        1     15587   7.99    4
28 2007-03-23      21764         503        1     15588   3.99    4
29 2007-03-23      22901          22        1     15589   6.99    4
30 2007-03-23      23284          62        1     15591   0.99    4
31 2007-03-23      24163         153        1     15593   1.99    4
32 2007-03-23      25135         264        1     15595   4.99    4
33 2007-03-23      24478         186        1     15596   0.99    4
34 2007-03-23      23323          66        1     15598   8.99    4
35 2007-03-23      23648         103        1     15599   5.99    4
36 2007-03-23      23729         113        1     15600   1.99    4

Customer lifetime value

Which customers made their first order on a weekend, paid more than \$5, and have a customer lifetime value (total amount spent) which exceeds \$100?

▶ SQL

For this query we need to extract the day of the week for each transaction. This can be done using dow.

Note that in PostgreSQL Sunday is coded as 0, and Saturday as 6.

q = '''
SELECT t.* FROM (
    SELECT 
        p.*, 
        EXTRACT(dow FROM p.payment_date)::int  dow,
        (
            SELECT SUM(p3.amount) 
            FROM payment p3
            WHERE p3.customer_id = p.customer_id   
        ) as CLV
    FROM 
        payment p
    WHERE 
        p.payment_id = (
            SELECT MIN(p2.payment_id)
            FROM payment p2
            WHERE p.customer_id = p2.customer_id
        ) 
        AND 
            EXTRACT(dow FROM p.payment_date) IN (0, 6)
        AND 
            p.amount > 5     

    GROUP BY 1
)t WHERE t.CLV > 100
ORDER BY t.CLV DESC
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql
(17, 8)
payment_id customer_id staff_id rental_id amount payment_date dow clv
0 19029 137 1 2469 6.99 2007-02-18 18:52:49.996577 0 191.62
1 18572 21 2 2235 7.99 2007-02-18 02:37:16.996577 0 146.68
2 17526 346 1 1994 5.99 2007-02-17 09:35:32.996577 6 145.70
3 19502 265 2 2027 7.99 2007-02-17 11:35:22.996577 6 132.72
4 17509 342 2 2190 5.99 2007-02-17 23:58:17.996577 6 130.68
5 17866 436 1 2291 9.99 2007-02-18 06:05:12.996577 0 126.73
6 18099 497 2 2180 8.99 2007-02-17 23:16:09.996577 6 121.73
7 18995 128 2 2519 7.99 2007-02-18 22:47:47.996577 0 118.70
8 19496 263 2 2126 8.99 2007-02-17 19:23:02.996577 6 116.73
9 18636 32 2 1887 6.99 2007-02-17 02:21:44.996577 6 112.74
10 19345 225 2 2226 7.99 2007-02-18 02:08:22.996577 0 111.76
11 18395 579 2 2425 5.99 2007-02-18 16:06:11.996577 0 111.73
12 18554 16 2 1934 6.99 2007-02-17 05:33:23.996577 6 109.75
13 18666 40 2 2470 7.99 2007-02-18 18:56:57.996577 0 105.74
14 18446 593 2 2055 5.99 2007-02-17 13:55:29.996577 6 101.76
15 18367 572 2 1889 10.99 2007-02-17 02:33:38.996577 6 100.76
16 19441 251 1 2238 6.99 2007-02-18 02:50:32.996577 0 100.75

▶ Python

Let’s break the pipeline in two parts. First we compute the subset of customers who placed their first order on a week-end.

Note that unlike PostgreSQL, pandas codes Saturday as 5 and Sunday as 6.

subset_fo = (payment
               .assign(dow=lambda x: x.payment_date.dt.weekday) # extract the day of the week as an integer
               .groupby('customer_id')[['payment_date', 'payment_id', 'amount', 'rental_id', 'staff_id', 'dow']]
               .first()                                         # pick the first row for each group
               .query('(amount > 5) & (dow in [5, 6])')
             )
subset_fo.head()
payment_date payment_id amount rental_id staff_id dow
customer_id
16 2007-02-17 05:33:23.996577 18554 6.99 1934 2 5
17 2007-02-17 22:46:24.996577 18558 5.99 2175 2 5
21 2007-02-18 02:37:16.996577 18572 7.99 2235 2 6
32 2007-02-17 02:21:44.996577 18636 6.99 1887 2 5
40 2007-02-18 18:56:57.996577 18666 7.99 2470 2 6

We then join the payment table on this subset to get the information we want.

df = (payment
       .loc[payment.customer_id.isin(subset_fo.index)]
       .groupby('customer_id')[['amount']].sum()
       .rename(columns={'amount':'clv'})
       .query('clv >= 100')
       .join(subset_fo, how='left')
       .sort_values('clv', ascending=False)
       .reset_index()
       .loc[:, df_sql.columns]
    )

print(df.shape)
df
(17, 8)
payment_id customer_id staff_id rental_id amount payment_date dow clv
0 19029 137 1 2469 6.99 2007-02-18 18:52:49.996577 6 191.62
1 18572 21 2 2235 7.99 2007-02-18 02:37:16.996577 6 146.68
2 17526 346 1 1994 5.99 2007-02-17 09:35:32.996577 5 145.70
3 19502 265 2 2027 7.99 2007-02-17 11:35:22.996577 5 132.72
4 17509 342 2 2190 5.99 2007-02-17 23:58:17.996577 5 130.68
5 17866 436 1 2291 9.99 2007-02-18 06:05:12.996577 6 126.73
6 18099 497 2 2180 8.99 2007-02-17 23:16:09.996577 5 121.73
7 18995 128 2 2519 7.99 2007-02-18 22:47:47.996577 6 118.70
8 19496 263 2 2126 8.99 2007-02-17 19:23:02.996577 5 116.73
9 18636 32 2 1887 6.99 2007-02-17 02:21:44.996577 5 112.74
10 19345 225 2 2226 7.99 2007-02-18 02:08:22.996577 6 111.76
11 18395 579 2 2425 5.99 2007-02-18 16:06:11.996577 6 111.73
12 18554 16 2 1934 6.99 2007-02-17 05:33:23.996577 5 109.75
13 18666 40 2 2470 7.99 2007-02-18 18:56:57.996577 6 105.74
14 18446 593 2 2055 5.99 2007-02-17 13:55:29.996577 5 101.76
15 18367 572 2 1889 10.99 2007-02-17 02:33:38.996577 5 100.76
16 19441 251 1 2238 6.99 2007-02-18 02:50:32.996577 6 100.75
all(df==df_sql)
True

▶ R

Let’s do the same with R. We’ll drop the payment_date column at the end so that we can fit all the remaining columns horizontally.

Of course, R (lubridate) uses yet another encoding for the days of the week. Saturday is represented as 7 and Sunday as 1.

%%R

subset_fo = payment %>% 
              group_by(customer_id) %>% 
              mutate(dow=lubridate::wday(payment_date)) %>% 
              filter(row_number()==1 & dow%in% c(1, 7) & amount>5) 

df_r = payment %>% 
        right_join(subset_fo, by="customer_id") %>% 
        group_by(customer_id) %>% 
        summarise(clv=sum(amount.x)) %>% 
        filter(clv > 100) %>% 
        left_join(subset_fo, by='customer_id') %>% 
        select(payment_id, customer_id, staff_id, rental_id, amount, payment_date, dow, clv) %>% 
        arrange(desc(clv)) %>%
        select(-payment_date)

print(dim(df_r))
as.data.frame(df_r)
[1] 17  7
   payment_id customer_id staff_id rental_id amount dow    clv
1       19029         137        1      2469   6.99   1 191.62
2       18572          21        2      2235   7.99   1 146.68
3       17526         346        1      1994   5.99   7 145.70
4       19502         265        2      2027   7.99   7 132.72
5       17509         342        2      2190   5.99   7 130.68
6       17866         436        1      2291   9.99   1 126.73
7       18099         497        2      2180   8.99   7 121.73
8       18995         128        2      2519   7.99   1 118.70
9       19496         263        2      2126   8.99   7 116.73
10      18636          32        2      1887   6.99   7 112.74
11      19345         225        2      2226   7.99   1 111.76
12      18395         579        2      2425   5.99   1 111.73
13      18554          16        2      1934   6.99   7 109.75
14      18666          40        2      2470   7.99   1 105.74
15      18446         593        2      2055   5.99   7 101.76
16      18367         572        2      1889  10.99   7 100.76
17      19441         251        1      2238   6.99   1 100.75

How many movies have a replacement cost above or below the average replacement cost?

Once way to answer this is to compute how many movies have a replacement cost higher than the average, how many have a replacement cost lower or equal to the average, and take the union of the two.

▶ SQL

q = '''
SELECT t.grouping, COUNT(*) FROM (
    SELECT f.*, 'above' as grouping
    FROM film f
    WHERE f.replacement_cost > (SELECT AVG(f2.replacement_cost) FROM film f2) 

    UNION 

    SELECT f.*, 'below_eq' as grouping
    FROM film f
    WHERE f.replacement_cost <= (SELECT AVG(f2.replacement_cost) FROM film f2) 
    )t 
GROUP BY 1
'''
df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()
(2, 2)
grouping count
0 below_eq 464
1 above 536

▶ Python

The Python syntax is definitely simpler here because we can create a Boolean mask telling us whether the replacement cost is above average, and then use value_counts to tally up the True and False counts.

(film
  .assign(above=film.replacement_cost > film.replacement_cost.mean())
  .loc[:, 'above']
  .value_counts()
  .rename(index={True: 'above', False: 'below_eq'})
)
above       536
below_eq    464
Name: above, dtype: int64

▶ R

The R code is similar to the Python one.

%%R
film %>% 
  mutate(above=replacement_cost>mean(film$replacement_cost)) %>% 
  count(above) %>% 
  mutate(above=if_else(above, 'above', 'below_eq'))
# A tibble: 2 x 2
  above        n
  <chr>    <int>
1 below_eq   464
2 above      536

Which movie generated the more revenue, for each viewer rating category?

▶ SQL

The grouping by rating part is straightforward, however, we want the top performing movie within each group. We can use a window function (the OVER part) to do this.

q = '''
WITH some_table AS (
    SELECT 
        f.film_id, 
        f.title, 
        f.rating, 
        SUM(p.amount),
        ROW_NUMBER() OVER(PARTITION BY f.rating ORDER BY SUM(p.amount) DESC)

    FROM film f
    JOIN inventory i ON f.film_id = i.film_id
    JOIN rental r ON r.inventory_id = i.inventory_id
    JOIN payment p ON p.rental_id = r.rental_id
    GROUP BY 1, 2, 3
    ORDER BY 3) 
    
SELECT st.* FROM some_table st 
WHERE st.row_number = 1
ORDER BY 4 DESC
'''
df_sql = pd.read_sql_query(q, con)
df_sql
film_id title rating sum row_number
0 879 Telegraph Voyage PG 215.75 1
1 1000 Zorro Ark NC-17 199.72 1
2 460 Innocent Usual PG-13 191.74 1
3 764 Saturday Lambs G 190.74 1
4 938 Velvet Terminator R 152.77 1

▶ Python

The pattern below is a common one. We group by some features to compute an aggregate measure, and then ungroup (reset_index) to group the data again at a different level of granularity.

df = (film
       .merge(inventory, on='film_id', how='left')    
       .merge(rental, on='inventory_id', how='left')
       .merge(payment, on='rental_id', how='left')
       .groupby(['rating','film_id'])[['amount']]
       .sum()
       .sort_values('amount', ascending=False)
       .rename(columns={'amount': 'sum'})
       .reset_index()
       .groupby('rating')
       .first()
       .merge(film, on='film_id', how='left')
       .sort_values('sum', ascending=False)
       .reset_index()
       .loc[:, ['film_id','title','rating','sum']]
      )

df
film_id title rating sum
0 879 Telegraph Voyage PG 215.75
1 1000 Zorro Ark NC-17 199.72
2 460 Innocent Usual PG-13 191.74
3 764 Saturday Lambs G 190.74
4 938 Velvet Terminator R 152.77

▶ R

This is essentially the same approach as in Python.

%%R
df_r = film %>% 
  left_join(inventory, by='film_id') %>% 
  left_join(rental, by='inventory_id') %>% 
  left_join(payment, by='rental_id') %>% 
  group_by(rating, film_id) %>% 
  summarise(sum=sum(amount, na.rm=TRUE)) %>% 
  ungroup() %>% 
  arrange(desc(sum)) %>% 
  group_by(rating) %>% 
  filter(row_number()==1) %>% 
  left_join(film, by='film_id') %>% 
  select(film_id, title, rating.x, sum) 

as.data.frame(df_r)
  film_id             title rating.x    sum
1     879  Telegraph Voyage       PG 215.75
2    1000         Zorro Ark    NC-17 199.72
3     460    Innocent Usual    PG-13 191.74
4     764    Saturday Lambs        G 190.74
5     938 Velvet Terminator        R 152.77

Hopefully, the examples above show that a significant part of the syntax used to wrangle structured data (data arranged in tables), in both Python (pandas) and R (tidyverse), has been influenced by SQL.

Many of the pipelines we built are comprised of the same steps as the corresponding SQL query. This is especially true for basic queries where we do simple things like group by columns or build a simple aggregate measure. Similarly, syntactic differences aside, joining operations work in a comparable way across the three languages.

As we consider more complex operations, however, a direct port of the SQL querie to the other languages may not be the most optimal way. We saw a small example of this when we computed which days in March had no rentals. The good news, however, is that these languages aren’t mutually exclusive. After all, we did run all our SQL via Python (pandas). Rather these languages complement each other and can be used in conjunction to build pipelines that can perform complex manipulation of the data in an efficient way. Very often, SQL is used to first extract a subset of the data from a large database, which can then be manipulated and transformed further in Python or R.

To steal a line from Alexandre Dumas’ The Three Musketeers, this is truly a case of All for one and one for all!

Pandas (Blood)groupby

from math import *
import numpy as np
import pandas as pd
from pathlib import Path

%matplotlib inline
import matplotlib.pyplot as plt

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

1. Pandas GroupBy

To illustrate how the pandas groupby operation works let’s use some fake data. With numpy it’s trivial to generate random numerical data, however, it’s usually a lot more tedious to generate random people data that looks realistic.

A very useful tool for this is the Fake Name Generator. That’s what I used to generate this dataset.

Let’s load the dataset into a dataframe.

input_path = Path('data/people.csv')

data = pd.read_csv(input_path, encoding='utf-8')
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 8/25/1968 A+
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1/31/1962 A+
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1/10/1964 B+
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 4/12/1933 O-
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 11/20/1946 A+

The data contains information about fictitious people.

data.count()
GivenName        5000
Surname          5000
Gender           5000
StreetAddress    5000
City             5000
Country          5000
Birthday         5000
BloodType        5000
dtype: int64

Remark:

We've got different types of variables here. For instance, Gender is a categorical variable, so are BloodType and Country.

Birthday can be thought of as an interval variable, although we'll convert it to an Age variable, which is more convenient for our purpose, and will be treated as a continuous numeric variable.

Notice, however, that by default pandas will identify non-numeric data as a generic object, which is not the most efficient way of handling categorical data. Let's fix that.

data.dtypes
GivenName        object
Surname          object
Gender           object
StreetAddress    object
City             object
Country          object
Birthday         object
BloodType        object
dtype: object
data['Gender']  = data['Gender'].astype('category')
data['Country'] = data['Country'].astype('category')
data['BloodType']  = data['BloodType'].astype('category')
data.dtypes
GivenName          object
Surname            object
Gender           category
StreetAddress      object
City               object
Country          category
Birthday           object
BloodType        category
dtype: object

A. Dealing with dates.

Method 1 - datetime module.

One of the features (columns) in our dataset is Birthday. While this information may be useful if you wanted to find out, for instance, how many people share a birthday, most of the time we mainly care about their age.

One way to convert the Birthday into an Age feature would be to extract the year and compute the current year minus the birthday year. The split method of string objects could be used for that, however, there’s a more elegant, and general way of handling dates in Python using the datetime.strptime function.

Remark:

This isn't necessarily the fastest, or best way to handle date and time objects in pandas.

from datetime import datetime

t1 = datetime.strptime("21/01/2019", "%d/%m/%Y")
print(t1.year)
print(datetime.today()-t1)
2019
62 days, 20:14:58.495945

The datetime module allows you to manipulate date objects and perform basic operations on dates. It also allows you to format dates in a consistent way.

We can apply this to our Birthday feature.

data.Birthday[0], data.Birthday[1]
('8/25/1968', '1/31/1962')
datetime.strptime(data.Birthday[0],"%m/%d/%Y")
datetime.datetime(1968, 8, 25, 0, 0)
datetime.strptime(data.Birthday[1],"%m/%d/%Y")
datetime.datetime(1962, 1, 31, 0, 0)

Let’s use the Pandas apply method to format the dates in a consistent way

data.Birthday = data.apply(lambda row: datetime.strptime(row['Birthday'], "%m/%d/%Y"), axis='columns')
data.Birthday.head()
0   1968-08-25
1   1962-01-31
2   1964-01-10
3   1933-04-12
4   1946-11-20
Name: Birthday, dtype: datetime64[ns]

Method 2 - using the to_datetime method of pandas.

original_data = pd.read_csv(input_path, encoding='utf-8')
original_data.Birthday = pd.to_datetime(original_data.Birthday)
original_data.Birthday.head()
0   1968-08-25
1   1962-01-31
2   1964-01-10
3   1933-04-12
4   1946-11-20
Name: Birthday, dtype: datetime64[ns]

Method 3 - convert at reading time.

original_data = pd.read_csv(input_path, encoding='utf-8', parse_dates=['Birthday'])
original_data.Birthday.head()
0   1968-08-25
1   1962-01-31
2   1964-01-10
3   1933-04-12
4   1946-11-20
Name: Birthday, dtype: datetime64[ns]

We can now define an Age feature by subtracting the year of birth from the current year.

Remark:

Because we are not hard-coding (typing by hand) the current date, if we come back to this code in a year's time and run it again, the Age will be updated automatically.

We could compute the age using apply to iterate through the rows of the dataframe…

data.apply(lambda row: datetime.now().year - row['Birthday'].year, axis='columns').head()
0    51
1    57
2    55
3    86
4    73
dtype: int64

However, it is usually faster to operate on the whole dataframe at one.

Datetime methods on a pandas series (such as a column of a dataframe), can be accessed via the dt method.

(datetime.now().year - data.Birthday.dt.year).head()
0    51
1    57
2    55
3    86
4    73
Name: Birthday, dtype: int64
data['Age'] = datetime.now().year - data.Birthday.dt.year
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType Age
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 1968-08-25 A+ 51
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1962-01-31 A+ 57
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1964-01-10 B+ 55
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 1933-04-12 O- 86
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 1946-11-20 A+ 73

B. GroupBy feature.

Given the data, common questions to answer would be how certain features are distributed in each country, or for each gender.

These could be as simple as What is the average age in each country? to much more complex questions such as How many people of each blood type are there in each country, for each gender, for a given age group?

Fortunately, Pandas’ GroupBy method allows us to organise the data in ways only limited by our sagacity.

Let's look at the country distribution first.

data.Country.value_counts()
Australia         1800
Italy              800
United States      700
United Kingdom     500
New Zealand        500
France             500
Iceland            200
Name: Country, dtype: int64
data.Country.value_counts(ascending=True).plot(kind='barh', color='g', alpha=0.5);

png

The groupby method creates a GroupBy object.

The groupby object is a recipe for how to perform operation on the data.

To use a groupby object, we need to perform some operation on it.

grouped_by_country = data.groupby('Country')
grouped_by_country
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x106ca5710>
grouped_by_country.size()
Country
Australia         1800
France             500
Iceland            200
Italy              800
New Zealand        500
United Kingdom     500
United States      700
dtype: int64
grouped_by_country.ngroups
7
for group in grouped_by_country.groups:
    print(group)
Australia
France
Iceland
Italy
New Zealand
United Kingdom
United States
grouped_by_country['Age'].mean()
Country
Australia         57.817778
France            57.944000
Iceland           57.660000
Italy             56.607500
New Zealand       56.630000
United Kingdom    57.740000
United States     57.415714
Name: Age, dtype: float64

Remark:

Recall that the apply method allows you to apply an arbitrary function to all the rows in your dataframe. Therefore, as long as you can express your operations as a function (lambda or otherwise), you can include it in the apply, even if your function returns multiple values, provided you wrap them in a tuple.

grouped_by_country['Age'].apply(lambda x: (np.min(x), 
                                           f'{x.mean():0.2f}', 
                                           np.max(x), 
                                           f'{x.std():0.2f}')
                                )
Country
Australia         (24, 57.82, 91, 19.49)
France            (24, 57.94, 91, 19.54)
Iceland           (24, 57.66, 91, 19.46)
Italy             (24, 56.61, 91, 18.98)
New Zealand       (24, 56.63, 91, 19.05)
United Kingdom    (24, 57.74, 91, 19.53)
United States     (24, 57.42, 91, 19.29)
Name: Age, dtype: object

A different and nicer output can be obtained using the agg method on a groupby object.

grouped_by_country['Age'].agg(['min','mean','max','std'])
min mean max std
Country
Australia 24 57.817778 91 19.488031
France 24 57.944000 91 19.541869
Iceland 24 57.660000 91 19.461194
Italy 24 56.607500 91 18.981275
New Zealand 24 56.630000 91 19.049905
United Kingdom 24 57.740000 91 19.527905
United States 24 57.415714 91 19.285003

For categorical variables, such as BloodType, basic information can be extracted using the describe method.

grouped_by_country['BloodType'].describe()
count unique top freq
Country
Australia 1800 8 O+ 648
France 500 8 O+ 184
Iceland 200 7 O+ 86
Italy 800 8 O+ 299
New Zealand 500 8 O+ 185
United Kingdom 500 8 O+ 188
United States 700 8 O+ 273
grouped_by_country['GivenName'].describe()
count unique top freq
Country
Australia 1800 1368 Michael 9
France 500 444 Anna 4
Iceland 200 193 Dieter 2
Italy 800 677 Jennifer 6
New Zealand 500 447 James 4
United Kingdom 500 450 Ellis 3
United States 700 620 Lily 5

describe only tells us about the most frequent blood type. To get a count of the boodtypes let us use a Counter object.

Which item appears most frequently?

Counting the number of objects of a given type is such a common operation that Python has a very useful Counter object from the collections module. What a Counter object does to a sequence of items is group the elements of the sequence into bins according to their value, and count how many element each bin has. A Counter object also had several useful properties, for instance, to determine the most common object in the sequence.
from collections import Counter

L = ['a', 'b', 'c', 'a', 'a', 'c', 'b', 'b', 'b', 'b']
c = Counter(L)
c
Counter({'a': 3, 'b': 5, 'c': 2})
print(c.most_common())
print(c.most_common(2))
print(c.most_common(1))
[('b', 5), ('a', 3), ('c', 2)]
[('b', 5), ('a', 3)]
[('b', 5)]
grouped_by_country['BloodType'].apply(Counter)
Country            
Australia       A+     468.0
                A-      59.0
                AB+    100.0
                AB-      8.0
                B+     419.0
                B-      24.0
                O+     648.0
                O-      74.0
France          A+     145.0
                A-      17.0
                AB+     19.0
                AB-      1.0
                B+     106.0
                B-       4.0
                O+     184.0
                O-      24.0
Iceland         A+      45.0
                A-       6.0
                AB+      5.0
                B+      44.0
                B-       5.0
                O+      86.0
                O-       9.0
Italy           A+     207.0
                A-      31.0
                AB+     37.0
                AB-      5.0
                B+     185.0
                B-       4.0
                O+     299.0
                O-      32.0
New Zealand     A+     154.0
                A-       9.0
                AB+     21.0
                AB-      5.0
                B+     100.0
                B-       5.0
                O+     185.0
                O-      21.0
United Kingdom  A+     132.0
                A-      23.0
                AB+     25.0
                AB-      4.0
                B+     102.0
                B-       7.0
                O+     188.0
                O-      19.0
United States   A+     157.0
                A-      25.0
                AB+     35.0
                AB-      4.0
                B+     157.0
                B-      15.0
                O+     273.0
                O-      34.0
Name: BloodType, dtype: float64

Remark:

Note how the result is one long column of information. This result is actually a Pandas Series object (recall that a Series object is essentially an array with an index).

It may look a bit like a dataframe, but remember that in a dataframe each column corresponds to a unique feature. Here, what looks like the second column is actually a second level for the index. We'll talk about multi-index shortly.

The result as it stands may not be the easiest to read. It would be better if we could transform it back into a dataframe so that we could compare the results for two countries more easily. In Pandas, you can do that with one command: unstack.

grouped_by_country['BloodType'].apply(Counter).unstack()
A+ A- AB+ AB- B+ B- O+ O-
Country
Australia 468.0 59.0 100.0 8.0 419.0 24.0 648.0 74.0
France 145.0 17.0 19.0 1.0 106.0 4.0 184.0 24.0
Iceland 45.0 6.0 5.0 NaN 44.0 5.0 86.0 9.0
Italy 207.0 31.0 37.0 5.0 185.0 4.0 299.0 32.0
New Zealand 154.0 9.0 21.0 5.0 100.0 5.0 185.0 21.0
United Kingdom 132.0 23.0 25.0 4.0 102.0 7.0 188.0 19.0
United States 157.0 25.0 35.0 4.0 157.0 15.0 273.0 34.0

If we want to switch the index and the columns around, we need to specify the level parameter in unstack() which, by default, is -1, that is the last (innermost) level of the index.

grouped_by_country['BloodType'].apply(Counter).unstack(level=0)
Country Australia France Iceland Italy New Zealand United Kingdom United States
A+ 468.0 145.0 45.0 207.0 154.0 132.0 157.0
A- 59.0 17.0 6.0 31.0 9.0 23.0 25.0
AB+ 100.0 19.0 5.0 37.0 21.0 25.0 35.0
AB- 8.0 1.0 NaN 5.0 5.0 4.0 4.0
B+ 419.0 106.0 44.0 185.0 100.0 102.0 157.0
B- 24.0 4.0 5.0 4.0 5.0 7.0 15.0
O+ 648.0 184.0 86.0 299.0 185.0 188.0 273.0
O- 74.0 24.0 9.0 32.0 21.0 19.0 34.0

The level can also be specified by name

grouped_by_country['BloodType'].apply(Counter).unstack(level='Country')
Country Australia France Iceland Italy New Zealand United Kingdom United States
A+ 468.0 145.0 45.0 207.0 154.0 132.0 157.0
A- 59.0 17.0 6.0 31.0 9.0 23.0 25.0
AB+ 100.0 19.0 5.0 37.0 21.0 25.0 35.0
AB- 8.0 1.0 NaN 5.0 5.0 4.0 4.0
B+ 419.0 106.0 44.0 185.0 100.0 102.0 157.0
B- 24.0 4.0 5.0 4.0 5.0 7.0 15.0
O+ 648.0 184.0 86.0 299.0 185.0 188.0 273.0
O- 74.0 24.0 9.0 32.0 21.0 19.0 34.0

Note that we have a bunch of NaN in our dataframe. This indicates that no one from Iceland has blood type AB+ in our dataset.

Since there was no such data in our dataset, it appears as a missing value.

However, here, a value of 0 would be more appropriate. We can tell the dataframe to replace its missing values by 0 using the fillna method.

Since we’re dealing with count data, it’s also a good idea to convert the type to int.

grouped_by_country['BloodType'].apply(Counter).unstack().fillna(0).astype(int)
A+ A- AB+ AB- B+ B- O+ O-
Country
Australia 468 59 100 8 419 24 648 74
France 145 17 19 1 106 4 184 24
Iceland 45 6 5 0 44 5 86 9
Italy 207 31 37 5 185 4 299 32
New Zealand 154 9 21 5 100 5 185 21
United Kingdom 132 23 25 4 102 7 188 19
United States 157 25 35 4 157 15 273 34

Remark:

We are using fake data so don't give too much credence to these results.

This being said, blood type frequencies do vary across countries and you can read more about it here.

C. GroupBy on multiple features.

Of, course, GroupBy operations aren’t limited to single features.

If we want to see the data grouped by country and blood type , we just need to specify both features in constructing the GroupBy object.

Note that with a groupby on multiple features we can obtain the previous result in a different way.

data.groupby(['Country','BloodType']).count().iloc[:, 0].unstack().fillna(0).astype(int)
BloodType A+ A- AB+ AB- B+ B- O+ O-
Country
Australia 468 59 100 8 419 24 648 74
France 145 17 19 1 106 4 184 24
Iceland 45 6 5 0 44 5 86 9
Italy 207 31 37 5 185 4 299 32
New Zealand 154 9 21 5 100 5 185 21
United Kingdom 132 23 25 4 102 7 188 19
United States 157 25 35 4 157 15 273 34

Let’s group our data based on country and gender.

grouped_by_country_and_gender = data.groupby(['Country', 'Gender'])

grouped_by_country_and_gender['BloodType'].apply(Counter).unstack()
A+ A- AB+ AB- B+ B- O+ O-
Country Gender
Australia female 229.0 29.0 54.0 4.0 205.0 13.0 328.0 41.0
male 239.0 30.0 46.0 4.0 214.0 11.0 320.0 33.0
France female 76.0 11.0 10.0 NaN 47.0 2.0 79.0 16.0
male 69.0 6.0 9.0 1.0 59.0 2.0 105.0 8.0
Iceland female 22.0 5.0 2.0 NaN 17.0 4.0 34.0 8.0
male 23.0 1.0 3.0 NaN 27.0 1.0 52.0 1.0
Italy female 102.0 14.0 22.0 1.0 89.0 1.0 160.0 17.0
male 105.0 17.0 15.0 4.0 96.0 3.0 139.0 15.0
New Zealand female 75.0 5.0 8.0 1.0 53.0 2.0 91.0 9.0
male 79.0 4.0 13.0 4.0 47.0 3.0 94.0 12.0
United Kingdom female 65.0 6.0 14.0 1.0 55.0 4.0 99.0 13.0
male 67.0 17.0 11.0 3.0 47.0 3.0 89.0 6.0
United States female 89.0 13.0 18.0 3.0 77.0 7.0 138.0 12.0
male 68.0 12.0 17.0 1.0 80.0 8.0 135.0 22.0

Once again let’s replace missing values by 0.

grouped_by_country_and_gender['BloodType'].apply(Counter).unstack().fillna(0).astype(int)
A+ A- AB+ AB- B+ B- O+ O-
Country Gender
Australia female 229 29 54 4 205 13 328 41
male 239 30 46 4 214 11 320 33
France female 76 11 10 0 47 2 79 16
male 69 6 9 1 59 2 105 8
Iceland female 22 5 2 0 17 4 34 8
male 23 1 3 0 27 1 52 1
Italy female 102 14 22 1 89 1 160 17
male 105 17 15 4 96 3 139 15
New Zealand female 75 5 8 1 53 2 91 9
male 79 4 13 4 47 3 94 12
United Kingdom female 65 6 14 1 55 4 99 13
male 67 17 11 3 47 3 89 6
United States female 89 13 18 3 77 7 138 12
male 68 12 17 1 80 8 135 22
grouped_by_country_and_gender.mean()
Age
Country Gender
Australia female 58.101883
male 57.531773
France female 59.207469
male 56.768340
Iceland female 57.315217
male 57.953704
Italy female 57.300493
male 55.893401
New Zealand female 55.823770
male 57.398438
United Kingdom female 57.284047
male 58.222222
United States female 57.299720
male 57.536443

Notice how we didn’t specify the Age feature. Pandas automatically outputs results for which the mean function makes sense. Here only Age fits the bill.

Like with any dataframe, you can also use apply to map a function to the grouped data, however, for anything more complex, like applying multiple functions at once, the agg method is more convenient.

grouped_by_country_and_gender['Age'].agg(['min','max','mean','std'])
min max mean std
Country Gender
Australia female 24 91 58.101883 19.587278
male 24 91 57.531773 19.394326
France female 25 91 59.207469 19.041143
male 24 91 56.768340 19.961407
Iceland female 24 91 57.315217 19.527343
male 24 91 57.953704 19.490896
Italy female 24 91 57.300493 19.116494
male 24 90 55.893401 18.838508
New Zealand female 24 91 55.823770 19.546233
male 24 91 57.398438 18.570202
United Kingdom female 24 91 57.284047 20.168651
male 24 91 58.222222 18.856132
United States female 24 91 57.299720 19.167134
male 24 91 57.536443 19.434197

For descriptive statistics of numerical data, the quantile function is also useful.

Remark:

The 0.5 quantile is the median of our data.

grouped_by_country_and_gender['Age'].quantile((0.10, 0.25, 0.5, 0.75, 0.90)).unstack()
0.1 0.25 0.5 0.75 0.9
Country Gender
Australia female 31.0 41.00 58.0 75.00 85.0
male 30.0 41.00 58.0 74.00 85.0
France female 31.0 44.00 61.0 76.00 84.0
male 30.0 37.00 57.0 75.00 83.0
Iceland female 32.1 40.75 59.0 74.00 83.9
male 32.0 42.00 56.5 74.00 85.0
Italy female 31.0 41.00 57.0 73.00 85.0
male 31.0 39.00 55.0 71.00 84.0
New Zealand female 29.3 39.00 54.0 72.25 83.0
male 32.0 41.00 58.0 72.00 83.5
United Kingdom female 29.6 39.00 57.0 75.00 85.0
male 30.2 43.00 59.0 73.00 83.0
United States female 31.0 41.00 57.0 74.00 84.0
male 30.0 40.50 58.0 74.00 83.0

How can we use the grouped_by_country_and_gender object to output the number of people who were born on a given month (numbered 1 to 12), for each country, and for each gender?

Hint: This can be done in one line using apply and value_counts, but you need to make sure you keep track of the type of object you are dealing with at each stage of the pipeline.

# Method 1 - one liner 
(grouped_by_country_and_gender['Birthday']
            .apply(lambda x: x.dt.month.value_counts())
            .unstack()
            .fillna(0))
1 2 3 4 5 6 7 8 9 10 11 12
Country Gender
Australia female 83 75 70 66 71 78 90 77 70 65 73 85
male 69 64 74 71 74 57 77 91 80 81 81 78
France female 26 15 18 13 30 22 22 19 11 21 21 23
male 22 15 22 28 19 22 26 23 26 22 16 18
Iceland female 7 11 3 6 5 10 10 12 9 4 7 8
male 7 11 15 2 8 9 11 6 12 10 9 8
Italy female 31 34 39 33 36 28 33 36 36 37 34 29
male 31 28 35 41 27 28 37 33 29 33 33 39
New Zealand female 18 20 19 21 22 26 26 15 14 23 23 17
male 21 22 16 22 30 19 16 25 18 25 23 19
United Kingdom female 17 21 19 22 14 20 31 22 19 24 34 14
male 21 16 22 26 20 21 23 18 17 21 19 19
United States female 38 32 35 24 37 25 25 25 32 31 21 32
male 44 30 23 25 23 29 31 29 26 28 26 29
# Method 2 - Starting from the original data
# create a month series first and use it to perform groupby

months = data.Birthday.dt.month
(data
   .groupby(['Country', 'Gender', months])['Age']
   .count()
   .unstack()
   .fillna(0))
Birthday 1 2 3 4 5 6 7 8 9 10 11 12
Country Gender
Australia female 83 75 70 66 71 78 90 77 70 65 73 85
male 69 64 74 71 74 57 77 91 80 81 81 78
France female 26 15 18 13 30 22 22 19 11 21 21 23
male 22 15 22 28 19 22 26 23 26 22 16 18
Iceland female 7 11 3 6 5 10 10 12 9 4 7 8
male 7 11 15 2 8 9 11 6 12 10 9 8
Italy female 31 34 39 33 36 28 33 36 36 37 34 29
male 31 28 35 41 27 28 37 33 29 33 33 39
New Zealand female 18 20 19 21 22 26 26 15 14 23 23 17
male 21 22 16 22 30 19 16 25 18 25 23 19
United Kingdom female 17 21 19 22 14 20 31 22 19 24 34 14
male 21 16 22 26 20 21 23 18 17 21 19 19
United States female 38 32 35 24 37 25 25 25 32 31 21 32
male 44 30 23 25 23 29 31 29 26 28 26 29
# Note that although we've used 'Age' here, any feature would have done. 
# Do you see why?

# Method 3 - Starting from the original data and assign a new month column
(data
   .assign(month=data.Birthday.dt.month)
   .groupby(['Country', 'Gender', 'month'])['Age']
   .count()
   .unstack()
   .fillna(0))
month 1 2 3 4 5 6 7 8 9 10 11 12
Country Gender
Australia female 83 75 70 66 71 78 90 77 70 65 73 85
male 69 64 74 71 74 57 77 91 80 81 81 78
France female 26 15 18 13 30 22 22 19 11 21 21 23
male 22 15 22 28 19 22 26 23 26 22 16 18
Iceland female 7 11 3 6 5 10 10 12 9 4 7 8
male 7 11 15 2 8 9 11 6 12 10 9 8
Italy female 31 34 39 33 36 28 33 36 36 37 34 29
male 31 28 35 41 27 28 37 33 29 33 33 39
New Zealand female 18 20 19 21 22 26 26 15 14 23 23 17
male 21 22 16 22 30 19 16 25 18 25 23 19
United Kingdom female 17 21 19 22 14 20 31 22 19 24 34 14
male 21 16 22 26 20 21 23 18 17 21 19 19
United States female 38 32 35 24 37 25 25 25 32 31 21 32
male 44 30 23 25 23 29 31 29 26 28 26 29

D. MultiIndex.

We can of course group the data by more than 2 features.

However, results might get a bit harder to take in.

s = data.groupby(['Country','Gender','BloodType'])['Age'].mean().round(2)
s
Country         Gender  BloodType
Australia       female  A+           59.36
                        A-           59.90
                        AB+          53.13
                        AB-          61.50
                        B+           56.62
                        B-           66.00
                        O+           58.46
                        O-           58.02
                male    A+           57.67
                        A-           51.70
                        AB+          59.98
                        AB-          62.00
                        B+           58.79
                        B-           44.73
                        O+           56.94
                        O-           59.76
France          female  A+           58.54
                        A-           53.09
                        AB+          63.40
                        B+           58.47
                        B-           61.50
                        O+           60.18
                        O-           61.06
                male    A+           60.81
                        A-           56.00
                        AB+          62.78
                        AB-          41.00
                        B+           55.03
                        B-           67.00
                        O+           53.90
                                     ...  
United Kingdom  female  AB+          49.79
                        AB-          63.00
                        B+           55.33
                        B-           65.50
                        O+           57.16
                        O-           63.23
                male    A+           57.39
                        A-           60.41
                        AB+          53.91
                        AB-          65.33
                        B+           59.68
                        B-           32.67
                        O+           58.37
                        O-           64.83
United States   female  A+           57.46
                        A-           71.54
                        AB+          58.44
                        AB-          53.67
                        B+           55.75
                        B-           65.57
                        O+           57.57
                        O-           41.83
                male    A+           55.07
                        A-           47.08
                        AB+          55.12
                        AB-          66.00
                        B+           54.34
                        B-           71.25
                        O+           61.33
                        O-           55.73
Name: Age, Length: 109, dtype: float64

Remark:

s is a Pandas Series object with a MultiIndex object as its index.

s.index
MultiIndex(levels=[['Australia', 'France', 'Iceland', 'Italy', 'New Zealand', 'United Kingdom', 'United States'], ['female', 'male'], ['A+', 'A-', 'AB+', 'AB-', 'B+', 'B-', 'O+', 'O-']],
           codes=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 4, 5, 6, 7, 0, 1, 2, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7]],
           names=['Country', 'Gender', 'BloodType'])

We can convert the MultiIndexed series to a MultiIndexed dataframe using unstack.

df = s.unstack().fillna(0)
df
BloodType A+ A- AB+ AB- B+ B- O+ O-
Country Gender
Australia female 59.36 59.90 53.13 61.50 56.62 66.00 58.46 58.02
male 57.67 51.70 59.98 62.00 58.79 44.73 56.94 59.76
France female 58.54 53.09 63.40 0.00 58.47 61.50 60.18 61.06
male 60.81 56.00 62.78 41.00 55.03 67.00 53.90 65.62
Iceland female 56.95 63.40 63.00 0.00 60.06 67.25 53.15 60.00
male 62.09 35.00 57.00 0.00 59.37 74.00 55.13 81.00
Italy female 55.68 55.36 68.50 46.00 56.07 57.00 57.96 55.12
male 54.45 54.47 62.60 44.75 54.82 66.00 57.32 55.47
New Zealand female 57.49 59.20 50.50 67.00 58.19 45.00 53.22 58.33
male 60.39 74.75 65.08 64.00 53.96 52.33 54.68 57.42
United Kingdom female 58.42 63.17 49.79 63.00 55.33 65.50 57.16 63.23
male 57.39 60.41 53.91 65.33 59.68 32.67 58.37 64.83
United States female 57.46 71.54 58.44 53.67 55.75 65.57 57.57 41.83
male 55.07 47.08 55.12 66.00 54.34 71.25 61.33 55.73

To access specific data using the MultiIndex for a Series, you can use the xs method and specify the level at which you are extracting the data, if you are not using the top level.

To access specific data using the MultiIndex for a DataFrame, you can use the loc method, but specifying the level is a bit more involved.

Top level selection

s['Australia']
Gender  BloodType
female  A+           59.36
        A-           59.90
        AB+          53.13
        AB-          61.50
        B+           56.62
        B-           66.00
        O+           58.46
        O-           58.02
male    A+           57.67
        A-           51.70
        AB+          59.98
        AB-          62.00
        B+           58.79
        B-           44.73
        O+           56.94
        O-           59.76
Name: Age, dtype: float64
df.loc['Australia', :]
BloodType A+ A- AB+ AB- B+ B- O+ O-
Gender
female 59.36 59.9 53.13 61.5 56.62 66.00 58.46 58.02
male 57.67 51.7 59.98 62.0 58.79 44.73 56.94 59.76

This is equivalent to:

s['Australia'].unstack()
BloodType A+ A- AB+ AB- B+ B- O+ O-
Gender
female 59.36 59.9 53.13 61.5 56.62 66.00 58.46 58.02
male 57.67 51.7 59.98 62.0 58.79 44.73 56.94 59.76

The levels can be exchanged either by unstacking the right level, or by transposing the resulting dataframe.

s['Australia'].unstack(level='Gender')
Gender female male
BloodType
A+ 59.36 57.67
A- 59.90 51.70
AB+ 53.13 59.98
AB- 61.50 62.00
B+ 56.62 58.79
B- 66.00 44.73
O+ 58.46 56.94
O- 58.02 59.76
s['Australia'].unstack().T
Gender female male
BloodType
A+ 59.36 57.67
A- 59.90 51.70
AB+ 53.13 59.98
AB- 61.50 62.00
B+ 56.62 58.79
B- 66.00 44.73
O+ 58.46 56.94
O- 58.02 59.76

To select data at a deeper level of the multiindex, you need to specify the level.

Here’s an example with Gender.

Notice the different syntax using xs.

s.xs('female', level='Gender').unstack()
BloodType A+ A- AB+ AB- B+ B- O+ O-
Country
Australia 59.36 59.90 53.13 61.50 56.62 66.00 58.46 58.02
France 58.54 53.09 63.40 NaN 58.47 61.50 60.18 61.06
Iceland 56.95 63.40 63.00 NaN 60.06 67.25 53.15 60.00
Italy 55.68 55.36 68.50 46.00 56.07 57.00 57.96 55.12
New Zealand 57.49 59.20 50.50 67.00 58.19 45.00 53.22 58.33
United Kingdom 58.42 63.17 49.79 63.00 55.33 65.50 57.16 63.23
United States 57.46 71.54 58.44 53.67 55.75 65.57 57.57 41.83

Selecting the data from a deeper level is trickier with a dataframe.

We need to specify an index slice.

df.loc[(slice(None), 'female'), :].droplevel('Gender')
BloodType A+ A- AB+ AB- B+ B- O+ O-
Country
Australia 59.36 59.90 53.13 61.50 56.62 66.00 58.46 58.02
France 58.54 53.09 63.40 0.00 58.47 61.50 60.18 61.06
Iceland 56.95 63.40 63.00 0.00 60.06 67.25 53.15 60.00
Italy 55.68 55.36 68.50 46.00 56.07 57.00 57.96 55.12
New Zealand 57.49 59.20 50.50 67.00 58.19 45.00 53.22 58.33
United Kingdom 58.42 63.17 49.79 63.00 55.33 65.50 57.16 63.23
United States 57.46 71.54 58.44 53.67 55.75 65.57 57.57 41.83

or alternatively, using a pandas IndexSlice object.

idx = pd.IndexSlice
df.loc[idx[:, 'female'], :].droplevel('Gender')
BloodType A+ A- AB+ AB- B+ B- O+ O-
Country
Australia 59.36 59.90 53.13 61.50 56.62 66.00 58.46 58.02
France 58.54 53.09 63.40 0.00 58.47 61.50 60.18 61.06
Iceland 56.95 63.40 63.00 0.00 60.06 67.25 53.15 60.00
Italy 55.68 55.36 68.50 46.00 56.07 57.00 57.96 55.12
New Zealand 57.49 59.20 50.50 67.00 58.19 45.00 53.22 58.33
United Kingdom 58.42 63.17 49.79 63.00 55.33 65.50 57.16 63.23
United States 57.46 71.54 58.44 53.67 55.75 65.57 57.57 41.83

Same if you want to select BloodType as the level.

s.xs('O+', level='BloodType').unstack().fillna(0)
Gender female male
Country
Australia 58.46 56.94
France 60.18 53.90
Iceland 53.15 55.13
Italy 57.96 57.32
New Zealand 53.22 54.68
United Kingdom 57.16 58.37
United States 57.57 61.33
df.loc[:, ['O+']].unstack().droplevel('BloodType', axis='columns')
Gender female male
Country
Australia 58.46 56.94
France 60.18 53.90
Iceland 53.15 55.13
Italy 57.96 57.32
New Zealand 53.22 54.68
United Kingdom 57.16 58.37
United States 57.57 61.33

Multiple levels can be specified at once

s.xs(('Australia','female'), level=['Country','Gender'])
BloodType
A+     59.36
A-     59.90
AB+    53.13
AB-    61.50
B+     56.62
B-     66.00
O+     58.46
O-     58.02
Name: Age, dtype: float64
df.loc[('Australia', 'female'), :]
BloodType
A+     59.36
A-     59.90
AB+    53.13
AB-    61.50
B+     56.62
B-     66.00
O+     58.46
O-     58.02
Name: (Australia, female), dtype: float64
s.xs(('Australia','O+'), level=['Country','BloodType'])
Gender
female    58.46
male      56.94
Name: Age, dtype: float64
df.loc['Australia', 'O+']
Gender
female    58.46
male      56.94
Name: O+, dtype: float64

Let’s visualise some of these results

# colormap from matplotlib
from matplotlib import cm

# Also see this stackoverflow post for the legend placement: 
# http://stackoverflow.com/questions/23556153/how-to-put-legend-outside-the-plot-with-pandas
ax = (s.xs('female', level='Gender')
       .unstack()
       .fillna(0)
       .plot(kind='bar', 
             figsize=(15,5), 
             width=0.8, 
             alpha=0.8, 
             ec='k',
             rot=0,
             colormap=cm.Paired_r)
     )
ax.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

png

Notice the difference between the previous plot and the one below.

ax = (s.xs('female', level='Gender')
       .unstack('Country')
       .fillna(0)
       .plot(kind='bar', 
             figsize=(15,5), 
             width=0.8, 
             alpha=0.8, 
             ec='k',
             rot=0,
             colormap=cm.Paired_r)
     )
ax.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

png

E. Fine Tuning.

So far we’ve applied an operation on a single feature of a GroupBy object.

For instance, we’ve computed the mean for the Age feature, and then, in a separate computation we’ve computed a count of the blood types.

You can actually do everything at once by specifying a different operation for each feature using a dictionary, and use the agg method to aggregate the data across the dataframe.

group = data.groupby(['Country', 'Gender'])

def data_count(x):
    return Counter(x).items()

method = {'Age': [np.mean, np.std], 'BloodType': data_count}

group.agg(method).rename({'data_count':'group_counts'}, axis=1)
Age BloodType
mean std group_counts
Country Gender
Australia female 58.101883 19.587278 ((B+, 205), (O-, 41), (AB+, 54), (O+, 328), (A...
male 57.531773 19.394326 ((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...
France female 59.207469 19.041143 ((A-, 11), (B+, 47), (A+, 76), (O+, 79), (O-, ...
male 56.768340 19.961407 ((B+, 59), (A+, 69), (AB+, 9), (O+, 105), (O-,...
Iceland female 57.315217 19.527343 ((O+, 34), (A+, 22), (B-, 4), (B+, 17), (A-, 5...
male 57.953704 19.490896 ((O+, 52), (A+, 23), (AB+, 3), (B+, 27), (O-, ...
Italy female 57.300493 19.116494 ((A+, 102), (O+, 160), (B+, 89), (O-, 17), (AB...
male 55.893401 18.838508 ((A+, 105), (O+, 139), (B+, 96), (AB-, 4), (B-...
New Zealand female 55.823770 19.546233 ((O-, 9), (O+, 91), (A+, 75), (AB+, 8), (B+, 5...
male 57.398438 18.570202 ((A+, 79), (O+, 94), (AB+, 13), (B+, 47), (O-,...
United Kingdom female 57.284047 20.168651 ((A+, 65), (B+, 55), (O-, 13), (O+, 99), (AB+,...
male 58.222222 18.856132 ((A+, 67), (B+, 47), (O-, 6), (A-, 17), (O+, 8...
United States female 57.299720 19.167134 ((A+, 89), (B+, 77), (O+, 138), (AB+, 18), (A-...
male 57.536443 19.434197 ((A+, 68), (O+, 135), (B+, 80), (A-, 12), (AB+...

Accessing a particular portion of the result is just a matter of navigating the multilevel structure of the index and the columns.

group.agg(method).loc[[('Australia', 'male')]]
Age BloodType
mean std data_count
Country Gender
Australia male 57.531773 19.394326 ((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...

This is equivalent to

group.agg(method).xs(('Australia', 'male'), level=['Country', 'Gender'])
Age BloodType
mean std data_count
Country Gender
Australia male 57.531773 19.394326 ((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...

Notice the difference with

group.agg(method).xs(('Australia', 'male'))
Age        mean                                                    57.5318
           std                                                     19.3943
BloodType  data_count    ((B+, 214), (AB+, 46), (A+, 239), (O+, 320), (...
Name: (Australia, male), dtype: object

We can select data using levels of the columns.

group.agg(method).loc[:,[('Age', 'mean')]]
Age
mean
Country Gender
Australia female 58.101883
male 57.531773
France female 59.207469
male 56.768340
Iceland female 57.315217
male 57.953704
Italy female 57.300493
male 55.893401
New Zealand female 55.823770
male 57.398438
United Kingdom female 57.284047
male 58.222222
United States female 57.299720
male 57.536443

This is equivalent to

group.agg(method).xs([('Age', 'mean')], axis=1)
Age
mean
Country Gender
Australia female 58.101883
male 57.531773
France female 59.207469
male 56.768340
Iceland female 57.315217
male 57.953704
Italy female 57.300493
male 55.893401
New Zealand female 55.823770
male 57.398438
United Kingdom female 57.284047
male 58.222222
United States female 57.299720
male 57.536443

You can go as “deep” as you need.

group.agg(method).loc['Australia', 'female']['Age']['mean']
58.10188261351052

This is equivalent to

group.agg(method).xs(('Australia', 'female'))['Age']['mean']
58.10188261351052

To extract just the values (without the index), use the values attribute.

group.agg(method).loc['Australia', 'female']['BloodType'].values
array([dict_items([('B+', 205), ('O-', 41), ('AB+', 54), ('O+', 328), ('A+', 229), ('A-', 29), ('B-', 13), ('AB-', 4)])],
      dtype=object)

Cross-tabulation is an effective way to have a quick look at how the data is distributed across categories.

pd.crosstab(data.Gender, data.BloodType, margins=True)
BloodType A+ A- AB+ AB- B+ B- O+ O- All
Gender
female 658 83 128 10 543 33 929 116 2500
male 650 87 114 17 570 31 934 97 2500
All 1308 170 242 27 1113 64 1863 213 5000
pd.crosstab([data.Country, data.Gender], data.BloodType, margins=True)
BloodType A+ A- AB+ AB- B+ B- O+ O- All
Country Gender
Australia female 229 29 54 4 205 13 328 41 903
male 239 30 46 4 214 11 320 33 897
France female 76 11 10 0 47 2 79 16 241
male 69 6 9 1 59 2 105 8 259
Iceland female 22 5 2 0 17 4 34 8 92
male 23 1 3 0 27 1 52 1 108
Italy female 102 14 22 1 89 1 160 17 406
male 105 17 15 4 96 3 139 15 394
New Zealand female 75 5 8 1 53 2 91 9 244
male 79 4 13 4 47 3 94 12 256
United Kingdom female 65 6 14 1 55 4 99 13 257
male 67 17 11 3 47 3 89 6 243
United States female 89 13 18 3 77 7 138 12 357
male 68 12 17 1 80 8 135 22 343
All 1308 170 242 27 1113 64 1863 213 5000

F. GroupBy on continuous features.

We’ve seen how we can group the data according to the features that are in the original dataset.

For instance, grouping the data by country is trivial. Same for the blood type because these are categorical data.

But what about Age? If we do a groupby without thinking too much about it we get something rather useless.

g = data.groupby('Age')
g['BloodType'].describe().unstack()
       Age
count  24     37
       25     72
       26     73
       27     86
       28     69
       29     81
       30     71
       31     86
       32     73
       33     78
       34     54
       35     79
       36     70
       37     88
       38     70
       39     81
       40     62
       41     75
       42     66
       43     70
       44     67
       45     80
       46     85
       47     65
       48     77
       49     88
       50     77
       51     75
       52     83
       53     86
              ..
freq   62     28
       63     41
       64     26
       65     26
       66     38
       67     28
       68     35
       69     36
       70     25
       71     25
       72     39
       73     21
       74     32
       75     39
       76     21
       77     20
       78     33
       79     32
       80     30
       81     21
       82     34
       83     27
       84     25
       85     25
       86     28
       87     32
       88     23
       89     30
       90     24
       91     18
Length: 272, dtype: object

We get one entry for each year of Age which isn’t what we want.

Ideally, we’d want to split the data into age groups and use that to group the data. Pandas has a cut function that allows us to do that.

data.Age.describe()
count    5000.000000
mean       57.447600
std        19.339422
min        24.000000
25%        41.000000
50%        57.000000
75%        74.000000
max        91.000000
Name: Age, dtype: float64
bins = np.arange(20, 100, 10)
bins
array([20, 30, 40, 50, 60, 70, 80, 90])
labels = pd.cut(data.Age, bins)
labels[:5]
0    (50, 60]
1    (50, 60]
2    (50, 60]
3    (80, 90]
4    (70, 80]
Name: Age, dtype: category
Categories (7, interval[int64]): [(20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80] < (80, 90]]
data.Age.head()
0    51
1    57
2    55
3    86
4    73
Name: Age, dtype: int64

We can use the newly created label to partition the Age variable into intervals.

grouped = data.groupby(['Country', labels])

grouped.size().unstack().fillna(0)
Age             (20, 30]  (30, 40]  (40, 50]  (50, 60]  (60, 70]  (70, 80]  \
Country                                                                      
Australia            177       254       269       275       257       280   
France                48        76        68        71        75        76   
Iceland               16        32        36        24        30        26   
Italy                 74       131       120       134       128        96   
New Zealand           49        76        81        70        93        57   
United Kingdom        54        69        77        64        72        91   
United States         71       103        99       111       102       115   

Age             (80, 90]  
Country                   
Australia            272  
France                82  
Iceland               34  
Italy                111  
New Zealand           71  
United Kingdom        69  
United States         94  

If we want the Age groups to be our index we can use unstack(0) or much the more explicit and therefore better, unstack('Country') .

grouped.size().unstack('Country').fillna(0)
Country Australia France Iceland Italy New Zealand United Kingdom United States
Age
(20, 30] 177 48 16 74 49 54 71
(30, 40] 254 76 32 131 76 69 103
(40, 50] 269 68 36 120 81 77 99
(50, 60] 275 71 24 134 70 64 111
(60, 70] 257 75 30 128 93 72 102
(70, 80] 280 76 26 96 57 91 115
(80, 90] 272 82 34 111 71 69 94

Just like before, we can pass more features to the groupby method.

grouped = data.groupby(['Country', 'BloodType', labels])
grouped.size().unstack('BloodType').fillna(0)
BloodType A+ A- AB+ AB- B+ B- O+ O-
Country Age
Australia (20, 30] 41.0 11.0 13.0 0.0 49.0 3.0 53.0 7.0
(30, 40] 63.0 8.0 14.0 2.0 57.0 4.0 97.0 9.0
(40, 50] 73.0 9.0 16.0 2.0 56.0 2.0 100.0 11.0
(50, 60] 74.0 5.0 16.0 0.0 67.0 2.0 100.0 11.0
(60, 70] 60.0 7.0 10.0 0.0 55.0 8.0 106.0 11.0
(70, 80] 85.0 7.0 15.0 1.0 57.0 4.0 101.0 10.0
(80, 90] 65.0 11.0 16.0 3.0 77.0 1.0 84.0 15.0
France (20, 30] 17.0 1.0 1.0 0.0 9.0 0.0 19.0 1.0
(30, 40] 17.0 5.0 1.0 0.0 22.0 0.0 28.0 3.0
(40, 50] 17.0 3.0 5.0 1.0 12.0 0.0 27.0 3.0
(50, 60] 18.0 1.0 1.0 0.0 14.0 2.0 32.0 3.0
(60, 70] 23.0 1.0 3.0 0.0 18.0 1.0 25.0 4.0
(70, 80] 22.0 5.0 4.0 0.0 14.0 0.0 26.0 5.0
(80, 90] 30.0 1.0 4.0 0.0 17.0 1.0 24.0 5.0
Iceland (20, 30] 4.0 0.0 0.0 0.0 2.0 0.0 10.0 0.0
(30, 40] 7.0 2.0 1.0 0.0 5.0 1.0 15.0 1.0
(40, 50] 5.0 1.0 1.0 0.0 10.0 0.0 17.0 2.0
(50, 60] 8.0 0.0 1.0 0.0 3.0 0.0 11.0 1.0
(60, 70] 5.0 1.0 0.0 0.0 10.0 0.0 12.0 2.0
(70, 80] 7.0 0.0 1.0 0.0 6.0 3.0 9.0 0.0
(80, 90] 9.0 2.0 1.0 0.0 8.0 1.0 10.0 3.0
Italy (20, 30] 19.0 5.0 0.0 1.0 23.0 0.0 22.0 4.0
(30, 40] 39.0 8.0 4.0 1.0 29.0 0.0 45.0 5.0
(40, 50] 36.0 4.0 4.0 2.0 27.0 1.0 40.0 6.0
(50, 60] 27.0 2.0 5.0 0.0 32.0 1.0 62.0 5.0
(60, 70] 33.0 2.0 8.0 1.0 27.0 1.0 52.0 4.0
(70, 80] 33.0 4.0 6.0 0.0 22.0 0.0 29.0 2.0
(80, 90] 19.0 5.0 8.0 0.0 25.0 1.0 47.0 6.0
New Zealand (20, 30] 13.0 0.0 0.0 0.0 11.0 1.0 22.0 2.0
(30, 40] 19.0 1.0 4.0 1.0 17.0 1.0 29.0 4.0
(40, 50] 17.0 0.0 2.0 0.0 18.0 1.0 39.0 4.0
(50, 60] 24.0 2.0 4.0 1.0 8.0 1.0 29.0 1.0
(60, 70] 33.0 3.0 6.0 1.0 21.0 0.0 26.0 3.0
(70, 80] 24.0 1.0 3.0 0.0 10.0 1.0 17.0 1.0
(80, 90] 23.0 2.0 2.0 2.0 15.0 0.0 21.0 6.0
United Kingdom (20, 30] 12.0 2.0 2.0 0.0 14.0 1.0 21.0 2.0
(30, 40] 21.0 2.0 7.0 0.0 13.0 3.0 23.0 0.0
(40, 50] 22.0 2.0 5.0 0.0 14.0 0.0 31.0 3.0
(50, 60] 14.0 3.0 3.0 1.0 14.0 1.0 24.0 4.0
(60, 70] 15.0 7.0 4.0 2.0 11.0 0.0 31.0 2.0
(70, 80] 28.0 3.0 2.0 1.0 24.0 1.0 28.0 4.0
(80, 90] 19.0 4.0 2.0 0.0 10.0 1.0 29.0 4.0
United States (20, 30] 21.0 3.0 4.0 0.0 16.0 1.0 21.0 5.0
(30, 40] 21.0 1.0 6.0 0.0 26.0 1.0 41.0 7.0
(40, 50] 23.0 3.0 3.0 1.0 30.0 1.0 31.0 7.0
(50, 60] 27.0 6.0 7.0 1.0 25.0 1.0 40.0 4.0
(60, 70] 17.0 6.0 5.0 2.0 26.0 2.0 39.0 5.0
(70, 80] 22.0 1.0 4.0 0.0 18.0 5.0 61.0 4.0
(80, 90] 24.0 5.0 6.0 0.0 15.0 3.0 39.0 2.0

The End…