SQL vs Python vs R

In this post we will look a how basic data wrangling operations can be performed in 3 languages commonly used in data science:

Structured Query Language (SQL)
Python
R

Note that while Python and R have obvious similarities in their approach to data wrangling, SQL is designed to work with relational data, where most of the time consuming operations are (ideally) performed on the database side rather than in the client’s memory. Both Python and R can process data in chunks, or even work with huge, distributed data sets through libraries like Dask (Python only) or Apache Spark (Python, R, Scala, Java). However, we’re mostly interested in illustrating some basic data wrangling pipelines in all three languages, so we’ll be using a small database that can easily fit in memory.

The Jupyter notebook allows us to run all our examples in one place. We’ll run an IPython kernel as our base platform and use an IPython magic command to run our R code within the same environment, and we’ll use sqlalchemy to run queries on a postgreSQL data base.

We won’t be covering the basics of each language in this post. There are many excellent resources online. For data wrangling the Python Data Science Handbook is an outstanding book. For R it’s hard to get past the wonderful R for Data Science by Garrett Grolemund and Hadley Wickham. For SQL there are more resources than we can mention, especially once we factor in the various vendor-specific flavours of SQL. One of my personal favourites is SQL Fundamentals for BI by Jeffrey James. Our SQL examples are based on some of the contents of that course.

We start by loading the Python required libraries (we’ll do all our data wrangling using the awesome pandas module) and create a connection to the PostgreSQL database using sqlalchemy. For our examples we’ll be using the well-known dvdrental database which contains 15 tables representing various aspects of a DVD rental business.

# Connect to a SQL (PostgreSQL) database
from sqlalchemy import create_engine
db = 'postgres:///dvdrental'
con = create_engine(db, echo=False)

# Import pandas for data wrangling in python
import pandas as pd
import warnings
warnings.simplefilter('ignore')

# use ipython magic command to load R into our environment.
%load_ext rpy2.ipython

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

All our R code will be preceded by %%R command to run it on the R interpreter from within our Jupyter notebook. We start by loading the tidyverse library (which is in fact a collection of libraries), and setting an option to improve how some outputs are rendered in the notebook.

%%R
library(tidyverse)
options(crayon.enabled = FALSE)

Because we’re running on top of an ipython kernel, we’ll write our SQL queries as strings and use pandas to run them on our database via sqlalchemy.

The data

First, let’s see what tables are in our DVD rental database.

q = '''
SELECT * FROM pg_catalog.pg_tables;
'''

df = pd.read_sql_query(q, con)
tables = df[~(df.tablename.str.startswith('pg_')|df.tablename.str.startswith('sql_'))].tablename.values
print(tables)

['category' 'country' 'film_category' 'language' 'inventory' 'actor'
 'staff' 'payment' 'rental' 'store' 'city' 'film' 'address' 'film_actor'
 'customer']

For convenience, let’s create a python dictionary with all the tables as dataframes.

Obviously, this means that we’re loading all the tables into memory, which is not what one would typically want to do with a SQL database, however, since our database is small, and we’re only interested in comparing the wrangling syntax across different languages, this is fine for our purpose.

tbl = {table: pd.read_sql_query(f'SELECT * FROM {table};', 
                                con, parse_dates=['payment_date']) 
       for table in tables}

Let’s look at a few rows of the payment table.

payment = tbl['payment']
payment.head()

	payment_id	customer_id	staff_id	rental_id	amount	payment_date
0	17503	341	2	1520	7.99	2007-02-15 22:25:46.996577
1	17504	341	1	1778	1.99	2007-02-16 17:23:14.996577
2	17505	341	1	1849	7.99	2007-02-16 22:41:45.996577
3	17506	341	2	2829	2.99	2007-02-19 19:39:56.996577
4	17507	341	2	3130	7.99	2007-02-20 17:31:48.996577

We can “pass” the payment dataframe to R in the following way:

%%R -i payment
head(payment)

  payment_id customer_id staff_id rental_id amount        payment_date
0      17503         341        2      1520   7.99 2007-02-15 22:25:46
1      17504         341        1      1778   1.99 2007-02-16 17:23:14
2      17505         341        1      1849   7.99 2007-02-16 22:41:45
3      17506         341        2      2829   2.99 2007-02-19 19:39:56
4      17507         341        2      3130   7.99 2007-02-20 17:31:48
5      17508         341        1      3382   5.99 2007-02-21 12:33:49

Now that we’ve got all the tables loaded, let’s run a few queries.

Let’s start with something simple.

Which movie rentals were below a dollar for customers with an id above 500?

▶ SQL

The query is nothing fancy here. We’re selecting a few features (columns) from the payment table, filtering on the dollar amount and customer id before ordering the rows.

q = '''
SELECT 
    p.payment_id, p.customer_id, p.amount 
FROM 
    payment p
WHERE 
    p.amount < 1 AND p.customer_id > 500
ORDER BY 1 ASC, 2 DESC, 3 ASC;
'''
df_sql = pd.read_sql_query(q, con)     # run the query on the database
print(df_sql.shape)                    # output the shape of the resulting table
df_sql.head()                          # display the first few rows

(413, 3)

	payment_id	customer_id	amount
0	18121	502	0.99
1	18122	502	0.99
2	18125	503	0.99
3	18134	506	0.99
4	18144	510	0.99

▶ Python

Pandas allow us to chain methods on a dataframe, however, writing everything on a single line is cumbersome for long pipelines.

We can write each step of the pipeline on its own line by wrapping the pipeline in brackets.

df = (payment[(payment.amount < 1) & (payment.customer_id > 500)]  # select rows based on conditions
         .loc[:, ['payment_id','customer_id','amount']]            # select 3 columns
         .sort_values(by=['payment_id', 'customer_id', 'amount'],  # sort the rows according to the values  
                      ascending=[True, False, True])               #      in the columns
         .reset_index(drop=True)                                   # not really required but looks nicer
        )
print(df.shape)
df.head()

(413, 3)

	payment_id	customer_id	amount
0	18121	502	0.99
1	18122	502	0.99
2	18125	503	0.99
3	18134	506	0.99
4	18144	510	0.99

▶ R

For this simple query the R code is similar to the Python one. The pipe symbol %>% is equivalent to the dot symbol in pandas which allows us to chain methods together.

The %%R -o df_r part of the code is a way to transfer the R dataframe back to our Python interpreter.

%%R -o df_r   
df_r = payment %>% 
        filter(amount < 1, customer_id > 500) %>% 
        select(payment_id, customer_id, amount) %>% 
        arrange(payment_id, desc(customer_id), amount)

df_r.shape
df_r.head()

	payment_id	customer_id	amount
0	18121	502	0.99
1	18122	502	0.99
2	18125	503	0.99
3	18134	506	0.99
4	18144	510	0.99

Sanity check

Just to make sure that we’ve run our query correctly, we can compare the result from the SQL query with the results of the Python and R pipelines. Because of the fact that R indexing starts at 1 instead of 0, we need to reset the index of the R dataframe after we export it to Python.

all(df==df_sql), all(df_r==df_sql)

(True, True)

Any other trivial query could be handled in a similar way…

Let’s look at a query where we need to extract a feature that doesn’t appear explicitly in the original data.

Which transactions occured in the month of March only?

The payment table has a payment_date feature from which we will need to extract the month.

▶ SQL

We can extract the month value out of the payment_date column using the EXTRACT function.

q = '''
SELECT 
    p.payment_id, p.customer_id cust_id, p.amount, p.payment_date
FROM 
    payment p
WHERE 
    EXTRACT(month FROM p.payment_date) = 3
    AND p.amount < 1
ORDER BY 
    cust_id DESC, 3 ASC;
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()

(1021, 4)

	payment_id	cust_id	amount	payment_date
0	22611	598	0.99	2007-03-02 15:51:25.996577
1	22613	598	0.99	2007-03-21 07:09:41.996577
2	22606	597	0.99	2007-03-19 05:51:32.996577
3	22589	596	0.99	2007-03-01 20:50:37.996577
4	22598	596	0.99	2007-03-22 21:49:07.996577

▶ Python

When we loaded the payment table from the database using sqlalchemy, the payment_date feature was correctly parsed as a datetime object.

This allows us to use the pandas dt accessor to extract a variety of date and time related features, including the month.

df = (payment[(payment.amount < 1) & (payment.payment_date.dt.month==3)]
       .rename({'customer_id': 'cust_id'}, axis=1)
       .reindex(['payment_id','cust_id','amount','payment_date'], axis=1)
       .sort_values(['cust_id', 'amount'], ascending=[False, True])
       .reset_index(drop=True)
     )
      
print(df.shape)
df.head()

(1021, 4)

	payment_id	cust_id	amount	payment_date
0	22611	598	0.99	2007-03-02 15:51:25.996577
1	22613	598	0.99	2007-03-21 07:09:41.996577
2	22606	597	0.99	2007-03-19 05:51:32.996577
3	22589	596	0.99	2007-03-01 20:50:37.996577
4	22592	596	0.99	2007-03-17 15:08:26.996577

▶ R

A great way to handle datetime objects in R is via the lubridate library. To make it clear when we’re using a lubridate object, we won’t load the library, but we call its functions using the lubridate:: prefix.

%%R -o df_r
df_r = payment %>% 
        filter(amount < 1, lubridate::month(payment$payment_date)==3) %>% 
        rename(cust_id=customer_id) %>% 
        select(-rental_id, -staff_id) %>% 
        arrange(desc(cust_id), amount)

print(df_r.shape)
df_r.head()

(1021, 4)

	payment_id	cust_id	amount	payment_date
0	22611	598	0.99	2007-03-02 15:51:25+11:00
1	22613	598	0.99	2007-03-21 07:09:41+11:00
2	22606	597	0.99	2007-03-19 05:51:32+11:00
3	22589	596	0.99	2007-03-01 20:50:37+11:00
4	22592	596	0.99	2007-03-17 15:08:26+11:00

Here we get an glimpse into the infinite joy of working with datetime objects.

During the transfer from Python to R and back to Python, our payment_date format has changed slightly.

Instead of worrying about fixing this we’ll just exclude that column when comparing the R result with the SQL/Python ones.

Henceforth, we won’t be worrying about this issue. We’ll simply check that the dataframes shape match and the first few rows are consistent.

all(df==df_sql), all(df.iloc[:, :-1]==df_r.iloc[:, :-1])

(True, True)

Which day saw the most business (largest number of transactions)?

This is a common operation. We need to group our payments by date, count how many transactions occured on that date and order the results accordingly.

▶ SQL

q = '''
SELECT 
    p.payment_date::date, COUNT(*)
FROM 
    payment p
GROUP BY 1
ORDER BY 2 DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(32, 2)

	payment_date	count
0	2007-04-30	1311
1	2007-03-01	676
2	2007-03-21	673
3	2007-04-27	643
4	2007-04-29	640
5	2007-03-19	631
6	2007-04-28	627
7	2007-03-18	624
8	2007-03-22	621
9	2007-03-20	611
10	2007-03-02	595
11	2007-03-17	584
12	2007-03-23	557
13	2007-04-08	516
14	2007-04-09	514
15	2007-04-06	486
16	2007-04-10	482
17	2007-04-07	472
18	2007-04-11	468
19	2007-04-12	452
20	2007-02-19	310
21	2007-02-15	308
22	2007-02-18	302
23	2007-02-20	291
24	2007-02-17	283
25	2007-02-16	282
26	2007-02-21	213
27	2007-05-14	182
28	2007-04-26	79
29	2007-03-16	72
30	2007-04-05	64
31	2007-02-14	27

▶ Python

The Python pipeline mimics the SQL query rather closely.

df = (payment
       .groupby(payment.payment_date.dt.date)['amount']
       .count().rename('count')
       .sort_values(ascending=False)
       .reset_index()
     )
df

	payment_date	count
0	2007-04-30	1311
1	2007-03-01	676
2	2007-03-21	673
3	2007-04-27	643
4	2007-04-29	640
5	2007-03-19	631
6	2007-04-28	627
7	2007-03-18	624
8	2007-03-22	621
9	2007-03-20	611
10	2007-03-02	595
11	2007-03-17	584
12	2007-03-23	557
13	2007-04-08	516
14	2007-04-09	514
15	2007-04-06	486
16	2007-04-10	482
17	2007-04-07	472
18	2007-04-11	468
19	2007-04-12	452
20	2007-02-19	310
21	2007-02-15	308
22	2007-02-18	302
23	2007-02-20	291
24	2007-02-17	283
25	2007-02-16	282
26	2007-02-21	213
27	2007-05-14	182
28	2007-04-26	79
29	2007-03-16	72
30	2007-04-05	64
31	2007-02-14	27

▶ R

We follow the same pattern for our R code. The conversion to a dataframe at the end (using as.data.frame) is simply to get a nicer formatting of the output.

%%R 
df_r = payment %>% 
        group_by(payment_date=lubridate::date(payment$payment_date)) %>% 
        summarise(n = n()) %>% 
        arrange(desc(n)) %>%
        as.data.frame()
df_r

   payment_date    n
1    2007-04-30 1311
2    2007-03-01  676
3    2007-03-21  673
4    2007-04-27  643
5    2007-04-29  640
6    2007-03-19  631
7    2007-04-28  627
8    2007-03-18  624
9    2007-03-22  621
10   2007-03-20  611
11   2007-03-02  595
12   2007-03-17  584
13   2007-03-23  557
14   2007-04-08  516
15   2007-04-09  514
16   2007-04-06  486
17   2007-04-10  482
18   2007-04-07  472
19   2007-04-11  468
20   2007-04-12  452
21   2007-02-19  310
22   2007-02-15  308
23   2007-02-18  302
24   2007-02-20  291
25   2007-02-17  283
26   2007-02-16  282
27   2007-02-21  213
28   2007-05-14  182
29   2007-04-26   79
30   2007-03-16   72
31   2007-04-05   64
32   2007-02-14   27

How much did each customer spend and when?

We can answer this question in a number of ways. Let’s return a sequence of transaction dates for each customer. In SQL we can do that with an array.

▶ SQL

q = '''
SELECT 
    p.customer_id, 
    SUM(p.amount) total, 
    ARRAY_AGG(p.payment_date::date) dates
FROM 
    payment p
GROUP BY 1
ORDER BY 2 DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()

(599, 3)

	customer_id	total	dates
0	148	211.55	[2007-02-15, 2007-02-15, 2007-02-19, 2007-02-1...
1	526	208.58	[2007-02-15, 2007-02-16, 2007-02-17, 2007-02-1...
2	178	194.61	[2007-02-15, 2007-02-15, 2007-02-16, 2007-02-1...
3	137	191.62	[2007-02-18, 2007-02-19, 2007-02-20, 2007-02-2...
4	144	189.60	[2007-02-16, 2007-02-17, 2007-02-19, 2007-02-2...

▶ Python

In Python we’ll gather the transaction dates for each customer as a list.

We can build that list using a lambda (anonymous) function during the aggregation step.

df = (payment
       .groupby('customer_id')[['amount', 'payment_date']]
       .agg({
               'amount':'sum',
               'payment_date': lambda x: list(x.dt.date)
            })
       .rename(columns={'amount':'total', 'payment_date': 'date'})
       .sort_values('total', ascending=False)
       .reset_index()
      )
print(df.shape)
df.head()

(599, 3)

	customer_id	total	date
0	148	211.55	[2007-02-15, 2007-02-15, 2007-02-19, 2007-02-1...
1	526	208.58	[2007-02-15, 2007-02-16, 2007-02-17, 2007-02-1...
2	178	194.61	[2007-02-15, 2007-02-15, 2007-02-16, 2007-02-1...
3	137	191.62	[2007-02-18, 2007-02-19, 2007-02-20, 2007-02-2...
4	144	189.60	[2007-02-16, 2007-02-17, 2007-02-19, 2007-02-2...

▶ R

For this query the R code is similar to the Python one (syntactic idiosyncrasies aside).

%%R 
df_r = payment %>% 
      mutate(payment_date = as.Date(payment_date)) %>% 
      group_by(customer_id) %>% 
      summarise(total=sum(amount), dates=list(payment_date)) %>% 
      arrange(desc(total))

print(dim(df_r))
print(head(df_r, 5))

[1] 599   3
# A tibble: 5 x 3
  customer_id total dates      
        <int> <dbl> <list>     
1         148  212. <date [45]>
2         526  209. <date [42]>
3         178  195. <date [39]>
4         137  192. <date [38]>
5         144  190. <date [40]>

At first glance it looks like the results are different from the SQL/Python, but this is purely cosmetic. The R data structure (a tibble actually) rounds up the floating point values and shrinks the lists when displaying the results. We could convert the tibble to a R dataframe but the output would be unwieldy. Instead, we can unpack the first row to show that the data has been parsed correctly.

%%R
print(df_r$total[1]) 
print(df_r$dates[1])

[1] 211.55
[[1]]
 [1] "2007-02-15" "2007-02-15" "2007-02-19" "2007-02-19" "2007-02-19"
 [6] "2007-03-01" "2007-03-02" "2007-03-17" "2007-03-17" "2007-03-18"
[11] "2007-03-18" "2007-03-18" "2007-03-20" "2007-03-20" "2007-03-20"
[16] "2007-03-21" "2007-03-21" "2007-03-21" "2007-03-21" "2007-03-22"
[21] "2007-03-22" "2007-03-22" "2007-03-22" "2007-04-05" "2007-04-06"
[26] "2007-04-08" "2007-04-08" "2007-04-08" "2007-04-09" "2007-04-10"
[31] "2007-04-11" "2007-04-12" "2007-04-27" "2007-04-27" "2007-04-27"
[36] "2007-04-28" "2007-04-28" "2007-04-28" "2007-04-29" "2007-04-29"
[41] "2007-04-29" "2007-04-29" "2007-04-30" "2007-04-29" "2007-04-30"

Which days of March 2007 recorded no sale?

This may seem like the sort of query we’ve already looked at, however there is a catch. The database only records transactions that have occurred (it seems obvious when said like that!). This means that dates with no sales aren’t in the database. Therefore we need to generate a series of dates (with a daily frequency) and join our payment data on it to find the days with no sales.

In PostgreSQL we can use generate_series to do that. Note that we select from it as if it were a table (which we name gs). By joining on it we can “expand” our payment data to include the dates for which no transaction took place. These would of course contain NULL values for the transaction data. By identifying where those NULL rows are we can answer the original question.

▶ SQL

q = '''
SELECT 
    gs::date, p.*
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date
WHERE 
    payment_date is NULL
'''

df_sql = pd.read_sql_query(q, con).fillna('')['gs']
print(df_sql.shape)
df_sql

(21,)
0     2007-03-03
1     2007-03-04
2     2007-03-05
3     2007-03-06
4     2007-03-07
5     2007-03-08
6     2007-03-09
7     2007-03-10
8     2007-03-11
9     2007-03-12
10    2007-03-13
11    2007-03-14
12    2007-03-15
13    2007-03-24
14    2007-03-25
15    2007-03-26
16    2007-03-27
17    2007-03-28
18    2007-03-29
19    2007-03-30
20    2007-03-31
Name: gs, dtype: object

▶ Python

We could do something similar in Python, however let’s try a more pythonic approach and use set operations to extract the dates we want. We simply construct the set of all dates in March 2007 and take the difference with the dates that are already in the database. The moral of the story here is that whereas the Python (and R) syntax can often look SQLish, this is not always the most concise or even suitable way to get to the answer.

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = pd.DataFrame(index = set(gs.date) - set(payment.payment_date.dt.date.values)).sort_index()
print(df.shape)
df

(21, 0)


2007-03-03
2007-03-04
2007-03-05
2007-03-06
2007-03-07
2007-03-08
2007-03-09
2007-03-10
2007-03-11
2007-03-12
2007-03-13
2007-03-14
2007-03-15
2007-03-24
2007-03-25
2007-03-26
2007-03-27
2007-03-28
2007-03-29
2007-03-30
2007-03-31

▶ R

Set operations also work in R. The output is a bit strange. We have a list of lists with a single element. We could flatten the nested list but it would only obfuscate the code so we’ll leave it like that.

%%R
gs = seq(lubridate::ymd('2007-03-01'), lubridate::ymd('2007-03-31'), by = '1 day')
lapply(sort(setdiff(gs, lubridate::date(payment$payment_date))), as.Date, origin='1970-01-01')

[[1]]
[1] "2007-03-03"

[[2]]
[1] "2007-03-04"

[[3]]
[1] "2007-03-05"

[[4]]
[1] "2007-03-06"

[[5]]
[1] "2007-03-07"

[[6]]
[1] "2007-03-08"

[[7]]
[1] "2007-03-09"

[[8]]
[1] "2007-03-10"

[[9]]
[1] "2007-03-11"

[[10]]
[1] "2007-03-12"

[[11]]
[1] "2007-03-13"

[[12]]
[1] "2007-03-14"

[[13]]
[1] "2007-03-15"

[[14]]
[1] "2007-03-24"

[[15]]
[1] "2007-03-25"

[[16]]
[1] "2007-03-26"

[[17]]
[1] "2007-03-27"

[[18]]
[1] "2007-03-28"

[[19]]
[1] "2007-03-29"

[[20]]
[1] "2007-03-30"

[[21]]
[1] "2007-03-31"

How many transactions were there for each day of March 2007?

This is a simple question, we just need to keep in mind that some days had no transaction. We should still ouput those with a count of zero. The basic idea is similar to the previous example. We create a time series to “pad out” the missing days in the payment_date feature.

▶ SQL

q = '''
SELECT 
    gs::date date, 
    COUNT(p.*) number_of_transactions
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date 
GROUP BY 
    gs::date 
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(31, 2)

	date	number_of_transactions
0	2007-03-01	676
1	2007-03-02	595
2	2007-03-03	0
3	2007-03-04	0
4	2007-03-05	0
5	2007-03-06	0
6	2007-03-07	0
7	2007-03-08	0
8	2007-03-09	0
9	2007-03-10	0
10	2007-03-11	0
11	2007-03-12	0
12	2007-03-13	0
13	2007-03-14	0
14	2007-03-15	0
15	2007-03-16	72
16	2007-03-17	584
17	2007-03-18	624
18	2007-03-19	631
19	2007-03-20	611
20	2007-03-21	673
21	2007-03-22	621
22	2007-03-23	557
23	2007-03-24	0
24	2007-03-25	0
25	2007-03-26	0
26	2007-03-27	0
27	2007-03-28	0
28	2007-03-29	0
29	2007-03-30	0
30	2007-03-31	0

▶ Python

Similarly, in Python we can create the series of dates for March 2007 using the date_range object from pandas.

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = (payment
       .assign(payment_date=pd.to_datetime(payment.payment_date).dt.date)
       .groupby('payment_date')
       .agg({'amount':'count'})
       .reindex(gs, fill_value=0)
       .reset_index()
       .rename(columns={'index':'date','amount':'number_of_transactions'})
)

print(df.shape)
df

(31, 2)

	date	number_of_transactions
0	2007-03-01	676
1	2007-03-02	595
2	2007-03-03	0
3	2007-03-04	0
4	2007-03-05	0
5	2007-03-06	0
6	2007-03-07	0
7	2007-03-08	0
8	2007-03-09	0
9	2007-03-10	0
10	2007-03-11	0
11	2007-03-12	0
12	2007-03-13	0
13	2007-03-14	0
14	2007-03-15	0
15	2007-03-16	72
16	2007-03-17	584
17	2007-03-18	624
18	2007-03-19	631
19	2007-03-20	611
20	2007-03-21	673
21	2007-03-22	621
22	2007-03-23	557
23	2007-03-24	0
24	2007-03-25	0
25	2007-03-26	0
26	2007-03-27	0
27	2007-03-28	0
28	2007-03-29	0
29	2007-03-30	0
30	2007-03-31	0

▶ R

Once again, the pattern for R follows the same principles.

%%R 
gs = data_frame(payment_date=seq(lubridate::ymd('2007-03-01'), 
                                 lubridate::ymd('2007-03-31'), 
                                 by = '1 day'))

df_r = payment %>% 
      mutate(payment_date=lubridate::date(payment_date)) %>% 
      group_by(payment_date) %>% 
      summarise(number_of_transactions=n()) %>% 
      right_join(gs, by='payment_date') %>%
      replace_na(list(number_of_transactions=0))

as.data.frame(df_r)

   payment_date number_of_transactions
1    2007-03-01                    676
2    2007-03-02                    595
3    2007-03-03                      0
4    2007-03-04                      0
5    2007-03-05                      0
6    2007-03-06                      0
7    2007-03-07                      0
8    2007-03-08                      0
9    2007-03-09                      0
10   2007-03-10                      0
11   2007-03-11                      0
12   2007-03-12                      0
13   2007-03-13                      0
14   2007-03-14                      0
15   2007-03-15                      0
16   2007-03-16                     72
17   2007-03-17                    584
18   2007-03-18                    624
19   2007-03-19                    631
20   2007-03-20                    611
21   2007-03-21                    673
22   2007-03-22                    621
23   2007-03-23                    557
24   2007-03-24                      0
25   2007-03-25                      0
26   2007-03-26                      0
27   2007-03-27                      0
28   2007-03-28                      0
29   2007-03-29                      0
30   2007-03-30                      0
31   2007-03-31                      0

Days with no transactions in March 2007 - Alternate version

Let’s revisit the problem we dealt with previous using set operations in Python and R and illustrate how we could answer the question using a more SQLish pipeline. We can reuse the code from the previous example. We just need to add a filtering step.

▶ SQL

q = '''
SELECT 
    gs::date date, COUNT(p.*) number_of_transactions
FROM 
    generate_series('2007-03-01', '2007-03-31', INTERVAL '1 Day') gs
LEFT JOIN 
    payment p 
ON 
    p.payment_date::date = gs::date 
GROUP BY 
    gs::date 
HAVING 
    COUNT(p.*) = 0
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(21, 2)

	date	number_of_transactions
0	2007-03-03	0
1	2007-03-04	0
2	2007-03-05	0
3	2007-03-06	0
4	2007-03-07	0
5	2007-03-08	0
6	2007-03-09	0
7	2007-03-10	0
8	2007-03-11	0
9	2007-03-12	0
10	2007-03-13	0
11	2007-03-14	0
12	2007-03-15	0
13	2007-03-24	0
14	2007-03-25	0
15	2007-03-26	0
16	2007-03-27	0
17	2007-03-28	0
18	2007-03-29	0
19	2007-03-30	0
20	2007-03-31	0

▶ Python

gs = pd.date_range(start='20070301', end='20070331', freq='D')
df = (payment
       .assign(payment_date=pd.to_datetime(payment.payment_date).dt.date)
       .groupby('payment_date')
       .agg({'amount':'count'})
       .rename(columns={'amount':'number_of_transactions'})
       .reindex(gs, fill_value=0)
       .query('number_of_transactions == 0')  
)

print(df.shape)
df

(21, 1)

	number_of_transactions
2007-03-03	0
2007-03-04	0
2007-03-05	0
2007-03-06	0
2007-03-07	0
2007-03-08	0
2007-03-09	0
2007-03-10	0
2007-03-11	0
2007-03-12	0
2007-03-13	0
2007-03-14	0
2007-03-15	0
2007-03-24	0
2007-03-25	0
2007-03-26	0
2007-03-27	0
2007-03-28	0
2007-03-29	0
2007-03-30	0
2007-03-31	0

▶ R

%%R
gs = data_frame(payment_date=seq(lubridate::ymd('2007-03-01'), 
                                 lubridate::ymd('2007-03-31'), 
                                 by = '1 day'))

df_r = payment %>% 
      mutate(payment_date=lubridate::date(payment_date)) %>% 
      group_by(payment_date) %>% 
      summarise(number_of_transactions=n()) %>% 
      right_join(gs, by="payment_date") %>%
      replace_na(list(number_of_transactions=0)) %>%
      subset(number_of_transactions==0)

print(dim(df_r))
as.data.frame(df_r)

[1] 21  2
   payment_date number_of_transactions
1    2007-03-03                      0
2    2007-03-04                      0
3    2007-03-05                      0
4    2007-03-06                      0
5    2007-03-07                      0
6    2007-03-08                      0
7    2007-03-09                      0
8    2007-03-10                      0
9    2007-03-11                      0
10   2007-03-12                      0
11   2007-03-13                      0
12   2007-03-14                      0
13   2007-03-15                      0
14   2007-03-24                      0
15   2007-03-25                      0
16   2007-03-26                      0
17   2007-03-27                      0
18   2007-03-28                      0
19   2007-03-29                      0
20   2007-03-30                      0
21   2007-03-31                      0

Which movies have never been rented?

So far we’ve only worked with a single table out of the 15 contained in the database. To match movies to rental transactions we will need to use 3 new tables. The film one contains information about the movies, the inventory one links the movies to an inventory id, which is used in the rental table to connect movies to actual rentals.

Let’s have a look at the structures of these tables.

for t in ['inventory', 'payment', 'rental', 'film']:
    q = f'''
SELECT * FROM {t}
LIMIT 1
'''
    df = pd.read_sql_query(q, con)
    print(t)
    display(df)

inventory

	inventory_id	film_id	store_id	last_update
0	1	1	1	2006-02-15 10:09:17

payment

	payment_id	customer_id	staff_id	rental_id	amount	payment_date
0	17503	341	2	1520	7.99	2007-02-15 22:25:46.996577

rental

	rental_id	rental_date	inventory_id	customer_id	return_date	staff_id	last_update
0	2	2005-05-24 22:54:33	1525	459	2005-05-28 19:40:33	1	2006-02-16 02:30:53

film

	film_id	title	description	release_year	language_id	rental_duration	rental_rate	length	replacement_cost	rating	last_update	special_features	fulltext
0	133	Chamber Italian	A Fateful Reflection of a Moose And a Husband ...	2006	1	7	4.99	117	14.99	NC-17	2013-05-26 14:50:58.951	[Trailers]	'chamber':1 'fate':4 'husband':11 'italian':2 ...

▶ SQL

Let’s join the tables, count how many times each movie has been rented and select those for which the count is zero.

q = '''
SELECT 
    t.film_id, t.title, t.rentals 
FROM 
    (SELECT 
        f.film_id, f.title, 
        COUNT(distinct r.rental_id) as rentals
    FROM film f
        LEFT JOIN inventory i ON i.film_id = f.film_id
        LEFT JOIN rental r ON r.inventory_id = i.inventory_id
    GROUP BY 1,2
    HAVING 
        COUNT(distinct r.rental_id) = 0
        ) t
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(42, 3)

	film_id	title
0	14	Alice Fantasia
1	33	Apollo Teen
2	36	Argonauts Town
3	38	Ark Ridgemont
4	41	Arsenic Independence
5	87	Boondock Ballroom
6	108	Butch Panther
7	128	Catch Amistad
8	144	Chinatown Gladiator
9	148	Chocolate Duck
10	171	Commandments Express
11	192	Crossing Divorce
12	195	Crowds Telemark
13	198	Crystal Breaking
14	217	Dazed Punk
15	221	Deliverance Mulholland
16	318	Firehouse Vietnam
17	325	Floats Garden
18	332	Frankenstein Stranger
19	359	Gladiator Westward
20	386	Gump Date
21	404	Hate Handicap
22	419	Hocus Frida
23	495	Kentuckian Giant
24	497	Kill Brotherhood
25	607	Muppet Mile
26	642	Order Betrayed
27	669	Pearl Destiny
28	671	Perdition Fargo
29	701	Psycho Shrunk
30	712	Raiders Antitrust
31	713	Rainbow Shock
32	742	Roof Champion
33	801	Sister Freddy
34	802	Sky Miracle
35	860	Suicides Silence
36	874	Tadpole Park
37	909	Treasure Command
38	943	Villain Desperate
39	950	Volume House
40	954	Wake Jaws
41	955	Walls Artist

▶ Python

The go to approach for join operations in pandas is the merge method. The type of join can be specified as an optional parameter.

film, inventory, rental = tbl['film'], tbl['inventory'], tbl['rental']

df = (film
        .merge(inventory, on='film_id', how='left')
        .merge(rental, on='inventory_id', how='left')
        .groupby(['film_id','title'])[['rental_id']]
        .count()
        .rename(columns={'rental_id':'rentals'})
        .query('rentals == 0')
        .reset_index()
     )
print(df.shape)
df

(42, 3)

	film_id	title
0	14	Alice Fantasia
1	33	Apollo Teen
2	36	Argonauts Town
3	38	Ark Ridgemont
4	41	Arsenic Independence
5	87	Boondock Ballroom
6	108	Butch Panther
7	128	Catch Amistad
8	144	Chinatown Gladiator
9	148	Chocolate Duck
10	171	Commandments Express
11	192	Crossing Divorce
12	195	Crowds Telemark
13	198	Crystal Breaking
14	217	Dazed Punk
15	221	Deliverance Mulholland
16	318	Firehouse Vietnam
17	325	Floats Garden
18	332	Frankenstein Stranger
19	359	Gladiator Westward
20	386	Gump Date
21	404	Hate Handicap
22	419	Hocus Frida
23	495	Kentuckian Giant
24	497	Kill Brotherhood
25	607	Muppet Mile
26	642	Order Betrayed
27	669	Pearl Destiny
28	671	Perdition Fargo
29	701	Psycho Shrunk
30	712	Raiders Antitrust
31	713	Rainbow Shock
32	742	Roof Champion
33	801	Sister Freddy
34	802	Sky Miracle
35	860	Suicides Silence
36	874	Tadpole Park
37	909	Treasure Command
38	943	Villain Desperate
39	950	Volume House
40	954	Wake Jaws
41	955	Walls Artist

▶ R

We start by “transfering” the new dataframes to R. The rest of the pipeline is very much like the one used in Python.

%%R -i rental -i inventory -i film -o df_r
df_r = film %>% 
        left_join(inventory, by="film_id") %>% 
        left_join(rental, by="inventory_id") %>% 
        group_by(film_id, title) %>% 
        summarise(rentals=n_distinct(rental_id, na.rm=TRUE)) %>% 
        filter(rentals==0) %>%
        as.data.frame()

df_r

   film_id                  title rentals
1       14         Alice Fantasia       0
2       33            Apollo Teen       0
3       36         Argonauts Town       0
4       38          Ark Ridgemont       0
5       41   Arsenic Independence       0
6       87      Boondock Ballroom       0
7      108          Butch Panther       0
8      128          Catch Amistad       0
9      144    Chinatown Gladiator       0
10     148         Chocolate Duck       0
11     171   Commandments Express       0
12     192       Crossing Divorce       0
13     195        Crowds Telemark       0
14     198       Crystal Breaking       0
15     217             Dazed Punk       0
16     221 Deliverance Mulholland       0
17     318      Firehouse Vietnam       0
18     325          Floats Garden       0
19     332  Frankenstein Stranger       0
20     359     Gladiator Westward       0
21     386              Gump Date       0
22     404          Hate Handicap       0
23     419            Hocus Frida       0
24     495       Kentuckian Giant       0
25     497       Kill Brotherhood       0
26     607            Muppet Mile       0
27     642         Order Betrayed       0
28     669          Pearl Destiny       0
29     671        Perdition Fargo       0
30     701          Psycho Shrunk       0
31     712      Raiders Antitrust       0
32     713          Rainbow Shock       0
33     742          Roof Champion       0
34     801          Sister Freddy       0
35     802            Sky Miracle       0
36     860       Suicides Silence       0
37     874           Tadpole Park       0
38     909       Treasure Command       0
39     943      Villain Desperate       0
40     950           Volume House       0
41     954              Wake Jaws       0
42     955           Walls Artist       0

all(df==df_sql), all(df==df_r)

(True, True)

Find each customer’s first order.

To solve the problem in SQL we need to use a correlated subquery whereby the inner query gets executed for every row of the outer query.

In our case, the inner query gets executed for every row of the outer query because min(rental_id) will change with each customer.

▶ SQL

q = '''
SELECT 
    r.customer_id, 
    min(r.rental_id) first_order_id, 
    (    
        SELECT r2.rental_date::date FROM rental r2 
        WHERE r2.rental_id = min(r.rental_id)
    ) first_order_date
FROM 
    rental r
GROUP BY 1
ORDER BY 1
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
display(df_sql.head())

(599, 3)

	customer_id	first_order_id	first_order_date
0	1	76	2005-05-25
1	2	320	2005-05-27
2	3	435	2005-05-27
3	4	1297	2005-06-15
4	5	731	2005-05-29

▶ Python

In this case the Python syntax is a bit simpler. We group by the customer_id and pick the smallest rental_id.

df = (rental
        .assign(rental_date=rental.rental_date.dt.date)
        .groupby(['customer_id' ])['rental_id','rental_date']
        .min()
        .reset_index()
        .rename(columns={'rental_id':'first_order_id', 'rental_date':'first_order_date'})
      )

print(df.shape)
display(df.head())

(599, 3)

	customer_id	first_order_id	first_order_date
0	1	76	2005-05-25
1	2	320	2005-05-27
2	3	435	2005-05-27
3	4	1297	2005-06-15
4	5	731	2005-05-29

all(df==df_sql)

True

▶ R

The R version of the code is similar to the Python one.

%%R 

df_r = rental %>% 
        mutate(rental_date=lubridate::date(rental_date)) %>% 
        group_by(customer_id) %>% 
        summarise_at(vars(rental_id, rental_date), funs(min)) %>%
        as.data.frame()

print(dim(df_r))
print(head(df_r, 5))

[1] 599   3
  customer_id rental_id rental_date
1           1        76  2005-05-25
2           2       320  2005-05-27
3           3       435  2005-05-27
4           4      1297  2005-06-15
5           5       731  2005-05-29

Handling multiple conditions.

Let’s look at a more complex query. We’d like to return all the rental transactions conducted by the staff with id 1, for which the rental id is larger than 15000, and the payment occured between 3am and 4am or 11pm and midnight in March 2007.

▶ SQL

Although more complex than our previous examples, this query can be built from the pieces we’ve used before. To keep things somewhat simple, we’ll assume that between 3am and 4am means between 3:00 and 3:59. This will allow use to only concern ourselves with the hour number when filtering our data. Same with the second time period.

q = '''
WITH base_table AS (
    SELECT gs::date AS date,  p.*
    FROM 
        generate_series('2007-03-01', '2007-03-31', INTERVAL '1 day') as gs
    LEFT JOIN payment p ON p.payment_date::date=gs::date AND p.staff_id = 1
    ORDER BY 1 NULLS FIRST
) 
SELECT 
    bt.date, bt.payment_id, bt.customer_id, bt.staff_id, bt.rental_id, bt.amount, 
    EXTRACT(hour FROM bt.payment_date)::int AS hour
FROM 
    base_table bt
WHERE 
    bt.rental_id > 15000 
    AND         
    EXTRACT(hour FROM bt.payment_date) IN (4,23)
ORDER BY bt.payment_date, bt.rental_id
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(36, 7)

	date	payment_id	customer_id	staff_id	rental_id	amount	hour
0	2007-03-22	23134	46	1	15438	2.99	23
1	2007-03-22	23958	136	1	15439	4.99	23
2	2007-03-22	23095	42	1	15442	2.99	23
3	2007-03-22	22615	598	1	15443	7.99	23
4	2007-03-22	22960	28	1	15445	4.99	23
5	2007-03-22	20574	379	1	15446	4.99	23
6	2007-03-22	21651	490	1	15448	2.99	23
7	2007-03-22	20027	322	1	15450	0.99	23
8	2007-03-22	21866	514	1	15451	2.99	23
9	2007-03-22	20370	359	1	15453	1.99	23
10	2007-03-22	24882	238	1	15455	0.99	23
11	2007-03-22	25113	262	1	15456	0.99	23
12	2007-03-22	19906	306	1	15457	2.99	23
13	2007-03-22	22876	20	1	15460	2.99	23
14	2007-03-22	23646	103	1	15461	5.99	23
15	2007-03-22	20675	389	1	15462	5.99	23
16	2007-03-22	23869	127	1	15463	5.99	23
17	2007-03-22	23283	62	1	15464	6.99	23
18	2007-03-22	21921	520	1	15465	0.99	23
19	2007-03-22	20970	418	1	15466	4.99	23
20	2007-03-22	23647	103	1	15467	3.99	23
21	2007-03-22	20768	399	1	15468	4.99	23
22	2007-03-22	22610	597	1	15469	4.99	23
23	2007-03-23	24692	215	1	15583	2.99	4
24	2007-03-23	21731	500	1	15584	2.99	4
25	2007-03-23	22149	545	1	15585	0.99	4
26	2007-03-23	21723	499	1	15587	7.99	4
27	2007-03-23	21764	503	1	15588	3.99	4
28	2007-03-23	22901	22	1	15589	6.99	4
29	2007-03-23	23284	62	1	15591	0.99	4
30	2007-03-23	24163	153	1	15593	1.99	4
31	2007-03-23	25135	264	1	15595	4.99	4
32	2007-03-23	24478	186	1	15596	0.99	4
33	2007-03-23	23323	66	1	15598	8.99	4
34	2007-03-23	23648	103	1	15599	5.99	4
35	2007-03-23	23729	113	1	15600	1.99	4

▶ Python

Just for some variety we use the between method of pandas series to select the March 2007 data. We pass pandas Timestamp objects to mark the first and last day of the month. The rest of the pipeline should be self-explanatory. The last line is added for convenience. We order the columns of the pandas dataframe to match the ordering of the columns in our SQL result.

df = (payment
  .assign(date=payment.payment_date.dt.date)
  .loc[payment.payment_date.between(
              pd.Timestamp(year=2007, month=3, day=1), 
              pd.Timestamp(year=2007, month=3, day=31))
      ]
  .assign(hour=payment.payment_date.dt.hour)
  .query('(hour in (4, 23)) & (rental_id > 15000) & (staff_id == 1)')
  .sort_values('payment_date')
  .reset_index()
  .loc[:, df_sql.columns]
) 

print(df.shape)
df

(36, 7)

	date	payment_id	customer_id	staff_id	rental_id	amount	hour
0	2007-03-22	23134	46	1	15438	2.99	23
1	2007-03-22	23958	136	1	15439	4.99	23
2	2007-03-22	23095	42	1	15442	2.99	23
3	2007-03-22	22615	598	1	15443	7.99	23
4	2007-03-22	22960	28	1	15445	4.99	23
5	2007-03-22	20574	379	1	15446	4.99	23
6	2007-03-22	21651	490	1	15448	2.99	23
7	2007-03-22	20027	322	1	15450	0.99	23
8	2007-03-22	21866	514	1	15451	2.99	23
9	2007-03-22	20370	359	1	15453	1.99	23
10	2007-03-22	24882	238	1	15455	0.99	23
11	2007-03-22	25113	262	1	15456	0.99	23
12	2007-03-22	19906	306	1	15457	2.99	23
13	2007-03-22	22876	20	1	15460	2.99	23
14	2007-03-22	23646	103	1	15461	5.99	23
15	2007-03-22	20675	389	1	15462	5.99	23
16	2007-03-22	23869	127	1	15463	5.99	23
17	2007-03-22	23283	62	1	15464	6.99	23
18	2007-03-22	21921	520	1	15465	0.99	23
19	2007-03-22	20970	418	1	15466	4.99	23
20	2007-03-22	23647	103	1	15467	3.99	23
21	2007-03-22	20768	399	1	15468	4.99	23
22	2007-03-22	22610	597	1	15469	4.99	23
23	2007-03-23	24692	215	1	15583	2.99	4
24	2007-03-23	21731	500	1	15584	2.99	4
25	2007-03-23	22149	545	1	15585	0.99	4
26	2007-03-23	21723	499	1	15587	7.99	4
27	2007-03-23	21764	503	1	15588	3.99	4
28	2007-03-23	22901	22	1	15589	6.99	4
29	2007-03-23	23284	62	1	15591	0.99	4
30	2007-03-23	24163	153	1	15593	1.99	4
31	2007-03-23	25135	264	1	15595	4.99	4
32	2007-03-23	24478	186	1	15596	0.99	4
33	2007-03-23	23323	66	1	15598	8.99	4
34	2007-03-23	23648	103	1	15599	5.99	4
35	2007-03-23	23729	113	1	15600	1.99	4

all(df==df_sql)

True

▶ R

Once again a similar pipeline can be build in R.

%%R -o df_r

df_r = payment %>% 
  mutate(date=lubridate::date(payment_date)) %>% 
  filter(date >= as.Date('2007-03-01') &  date <= as.Date('2007-03-31')) %>% 
  mutate(hour=lubridate::hour(payment_date)) %>% 
  filter(hour %in% c(4, 23) & rental_id > 15000 & staff_id == 1) %>% 
  arrange(payment_date) %>% 
  select(date, payment_id, customer_id, staff_id, rental_id, amount, hour) %>% 
  as.data.frame()

print(dim(df_r))
df_r

[1] 36  7
         date payment_id customer_id staff_id rental_id amount hour
1  2007-03-22      23134          46        1     15438   2.99   23
2  2007-03-22      23958         136        1     15439   4.99   23
3  2007-03-22      23095          42        1     15442   2.99   23
4  2007-03-22      22615         598        1     15443   7.99   23
5  2007-03-22      22960          28        1     15445   4.99   23
6  2007-03-22      20574         379        1     15446   4.99   23
7  2007-03-22      21651         490        1     15448   2.99   23
8  2007-03-22      20027         322        1     15450   0.99   23
9  2007-03-22      21866         514        1     15451   2.99   23
10 2007-03-22      20370         359        1     15453   1.99   23
11 2007-03-22      24882         238        1     15455   0.99   23
12 2007-03-22      25113         262        1     15456   0.99   23
13 2007-03-22      19906         306        1     15457   2.99   23
14 2007-03-22      22876          20        1     15460   2.99   23
15 2007-03-22      23646         103        1     15461   5.99   23
16 2007-03-22      20675         389        1     15462   5.99   23
17 2007-03-22      23869         127        1     15463   5.99   23
18 2007-03-22      23283          62        1     15464   6.99   23
19 2007-03-22      21921         520        1     15465   0.99   23
20 2007-03-22      20970         418        1     15466   4.99   23
21 2007-03-22      23647         103        1     15467   3.99   23
22 2007-03-22      20768         399        1     15468   4.99   23
23 2007-03-22      22610         597        1     15469   4.99   23
24 2007-03-23      24692         215        1     15583   2.99    4
25 2007-03-23      21731         500        1     15584   2.99    4
26 2007-03-23      22149         545        1     15585   0.99    4
27 2007-03-23      21723         499        1     15587   7.99    4
28 2007-03-23      21764         503        1     15588   3.99    4
29 2007-03-23      22901          22        1     15589   6.99    4
30 2007-03-23      23284          62        1     15591   0.99    4
31 2007-03-23      24163         153        1     15593   1.99    4
32 2007-03-23      25135         264        1     15595   4.99    4
33 2007-03-23      24478         186        1     15596   0.99    4
34 2007-03-23      23323          66        1     15598   8.99    4
35 2007-03-23      23648         103        1     15599   5.99    4
36 2007-03-23      23729         113        1     15600   1.99    4

Customer lifetime value

Which customers made their first order on a weekend, paid more than \$5, and have a customer lifetime value (total amount spent) which exceeds \$100?

▶ SQL

For this query we need to extract the day of the week for each transaction. This can be done using dow.

Note that in PostgreSQL Sunday is coded as 0, and Saturday as 6.

q = '''
SELECT t.* FROM (
    SELECT 
        p.*, 
        EXTRACT(dow FROM p.payment_date)::int  dow,
        (
            SELECT SUM(p3.amount) 
            FROM payment p3
            WHERE p3.customer_id = p.customer_id   
        ) as CLV
    FROM 
        payment p
    WHERE 
        p.payment_id = (
            SELECT MIN(p2.payment_id)
            FROM payment p2
            WHERE p.customer_id = p2.customer_id
        ) 
        AND 
            EXTRACT(dow FROM p.payment_date) IN (0, 6)
        AND 
            p.amount > 5     

    GROUP BY 1
)t WHERE t.CLV > 100
ORDER BY t.CLV DESC
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql

(17, 8)

	payment_id	customer_id	staff_id	rental_id	amount	payment_date	dow	clv
0	19029	137	1	2469	6.99	2007-02-18 18:52:49.996577	0	191.62
1	18572	21	2	2235	7.99	2007-02-18 02:37:16.996577	0	146.68
2	17526	346	1	1994	5.99	2007-02-17 09:35:32.996577	6	145.70
3	19502	265	2	2027	7.99	2007-02-17 11:35:22.996577	6	132.72
4	17509	342	2	2190	5.99	2007-02-17 23:58:17.996577	6	130.68
5	17866	436	1	2291	9.99	2007-02-18 06:05:12.996577	0	126.73
6	18099	497	2	2180	8.99	2007-02-17 23:16:09.996577	6	121.73
7	18995	128	2	2519	7.99	2007-02-18 22:47:47.996577	0	118.70
8	19496	263	2	2126	8.99	2007-02-17 19:23:02.996577	6	116.73
9	18636	32	2	1887	6.99	2007-02-17 02:21:44.996577	6	112.74
10	19345	225	2	2226	7.99	2007-02-18 02:08:22.996577	0	111.76
11	18395	579	2	2425	5.99	2007-02-18 16:06:11.996577	0	111.73
12	18554	16	2	1934	6.99	2007-02-17 05:33:23.996577	6	109.75
13	18666	40	2	2470	7.99	2007-02-18 18:56:57.996577	0	105.74
14	18446	593	2	2055	5.99	2007-02-17 13:55:29.996577	6	101.76
15	18367	572	2	1889	10.99	2007-02-17 02:33:38.996577	6	100.76
16	19441	251	1	2238	6.99	2007-02-18 02:50:32.996577	0	100.75

▶ Python

Let’s break the pipeline in two parts. First we compute the subset of customers who placed their first order on a week-end.

Note that unlike PostgreSQL, pandas codes Saturday as 5 and Sunday as 6.

subset_fo = (payment
               .assign(dow=lambda x: x.payment_date.dt.weekday) # extract the day of the week as an integer
               .groupby('customer_id')[['payment_date', 'payment_id', 'amount', 'rental_id', 'staff_id', 'dow']]
               .first()                                         # pick the first row for each group
               .query('(amount > 5) & (dow in [5, 6])')
             )
subset_fo.head()

	payment_date	payment_id	amount	rental_id	staff_id	dow
customer_id
16	2007-02-17 05:33:23.996577	18554	6.99	1934	2	5
17	2007-02-17 22:46:24.996577	18558	5.99	2175	2	5
21	2007-02-18 02:37:16.996577	18572	7.99	2235	2	6
32	2007-02-17 02:21:44.996577	18636	6.99	1887	2	5
40	2007-02-18 18:56:57.996577	18666	7.99	2470	2	6

We then join the payment table on this subset to get the information we want.

df = (payment
       .loc[payment.customer_id.isin(subset_fo.index)]
       .groupby('customer_id')[['amount']].sum()
       .rename(columns={'amount':'clv'})
       .query('clv >= 100')
       .join(subset_fo, how='left')
       .sort_values('clv', ascending=False)
       .reset_index()
       .loc[:, df_sql.columns]
    )

print(df.shape)
df

(17, 8)

	payment_id	customer_id	staff_id	rental_id	amount	payment_date	dow	clv
0	19029	137	1	2469	6.99	2007-02-18 18:52:49.996577	6	191.62
1	18572	21	2	2235	7.99	2007-02-18 02:37:16.996577	6	146.68
2	17526	346	1	1994	5.99	2007-02-17 09:35:32.996577	5	145.70
3	19502	265	2	2027	7.99	2007-02-17 11:35:22.996577	5	132.72
4	17509	342	2	2190	5.99	2007-02-17 23:58:17.996577	5	130.68
5	17866	436	1	2291	9.99	2007-02-18 06:05:12.996577	6	126.73
6	18099	497	2	2180	8.99	2007-02-17 23:16:09.996577	5	121.73
7	18995	128	2	2519	7.99	2007-02-18 22:47:47.996577	6	118.70
8	19496	263	2	2126	8.99	2007-02-17 19:23:02.996577	5	116.73
9	18636	32	2	1887	6.99	2007-02-17 02:21:44.996577	5	112.74
10	19345	225	2	2226	7.99	2007-02-18 02:08:22.996577	6	111.76
11	18395	579	2	2425	5.99	2007-02-18 16:06:11.996577	6	111.73
12	18554	16	2	1934	6.99	2007-02-17 05:33:23.996577	5	109.75
13	18666	40	2	2470	7.99	2007-02-18 18:56:57.996577	6	105.74
14	18446	593	2	2055	5.99	2007-02-17 13:55:29.996577	5	101.76
15	18367	572	2	1889	10.99	2007-02-17 02:33:38.996577	5	100.76
16	19441	251	1	2238	6.99	2007-02-18 02:50:32.996577	6	100.75

all(df==df_sql)

True

▶ R

Let’s do the same with R. We’ll drop the payment_date column at the end so that we can fit all the remaining columns horizontally.

Of course, R (lubridate) uses yet another encoding for the days of the week. Saturday is represented as 7 and Sunday as 1.

%%R

subset_fo = payment %>% 
              group_by(customer_id) %>% 
              mutate(dow=lubridate::wday(payment_date)) %>% 
              filter(row_number()==1 & dow%in% c(1, 7) & amount>5) 

df_r = payment %>% 
        right_join(subset_fo, by="customer_id") %>% 
        group_by(customer_id) %>% 
        summarise(clv=sum(amount.x)) %>% 
        filter(clv > 100) %>% 
        left_join(subset_fo, by='customer_id') %>% 
        select(payment_id, customer_id, staff_id, rental_id, amount, payment_date, dow, clv) %>% 
        arrange(desc(clv)) %>%
        select(-payment_date)

print(dim(df_r))
as.data.frame(df_r)

[1] 17  7
   payment_id customer_id staff_id rental_id amount dow    clv
1       19029         137        1      2469   6.99   1 191.62
2       18572          21        2      2235   7.99   1 146.68
3       17526         346        1      1994   5.99   7 145.70
4       19502         265        2      2027   7.99   7 132.72
5       17509         342        2      2190   5.99   7 130.68
6       17866         436        1      2291   9.99   1 126.73
7       18099         497        2      2180   8.99   7 121.73
8       18995         128        2      2519   7.99   1 118.70
9       19496         263        2      2126   8.99   7 116.73
10      18636          32        2      1887   6.99   7 112.74
11      19345         225        2      2226   7.99   1 111.76
12      18395         579        2      2425   5.99   1 111.73
13      18554          16        2      1934   6.99   7 109.75
14      18666          40        2      2470   7.99   1 105.74
15      18446         593        2      2055   5.99   7 101.76
16      18367         572        2      1889  10.99   7 100.76
17      19441         251        1      2238   6.99   1 100.75

How many movies have a replacement cost above or below the average replacement cost?

Once way to answer this is to compute how many movies have a replacement cost higher than the average, how many have a replacement cost lower or equal to the average, and take the union of the two.

▶ SQL

q = '''
SELECT t.grouping, COUNT(*) FROM (
    SELECT f.*, 'above' as grouping
    FROM film f
    WHERE f.replacement_cost > (SELECT AVG(f2.replacement_cost) FROM film f2) 

    UNION 

    SELECT f.*, 'below_eq' as grouping
    FROM film f
    WHERE f.replacement_cost <= (SELECT AVG(f2.replacement_cost) FROM film f2) 
    )t 
GROUP BY 1
'''

df_sql = pd.read_sql_query(q, con)
print(df_sql.shape)
df_sql.head()

(2, 2)

	grouping	count
0	below_eq	464
1	above	536

▶ Python

The Python syntax is definitely simpler here because we can create a Boolean mask telling us whether the replacement cost is above average, and then use value_counts to tally up the True and False counts.

(film
  .assign(above=film.replacement_cost > film.replacement_cost.mean())
  .loc[:, 'above']
  .value_counts()
  .rename(index={True: 'above', False: 'below_eq'})
)

above       536
below_eq    464
Name: above, dtype: int64

▶ R

The R code is similar to the Python one.

%%R
film %>% 
  mutate(above=replacement_cost>mean(film$replacement_cost)) %>% 
  count(above) %>% 
  mutate(above=if_else(above, 'above', 'below_eq'))

# A tibble: 2 x 2
  above        n
  <chr>    <int>
1 below_eq   464
2 above      536

Which movie generated the more revenue, for each viewer rating category?

▶ SQL

The grouping by rating part is straightforward, however, we want the top performing movie within each group. We can use a window function (the OVER part) to do this.

q = '''
WITH some_table AS (
    SELECT 
        f.film_id, 
        f.title, 
        f.rating, 
        SUM(p.amount),
        ROW_NUMBER() OVER(PARTITION BY f.rating ORDER BY SUM(p.amount) DESC)

    FROM film f
    JOIN inventory i ON f.film_id = i.film_id
    JOIN rental r ON r.inventory_id = i.inventory_id
    JOIN payment p ON p.rental_id = r.rental_id
    GROUP BY 1, 2, 3
    ORDER BY 3) 
    
SELECT st.* FROM some_table st 
WHERE st.row_number = 1
ORDER BY 4 DESC
'''

df_sql = pd.read_sql_query(q, con)
df_sql

	film_id	title	rating	sum	row_number
0	879	Telegraph Voyage	PG	215.75	1
1	1000	Zorro Ark	NC-17	199.72	1
2	460	Innocent Usual	PG-13	191.74	1
3	764	Saturday Lambs	G	190.74	1
4	938	Velvet Terminator	R	152.77	1

▶ Python

The pattern below is a common one. We group by some features to compute an aggregate measure, and then ungroup (reset_index) to group the data again at a different level of granularity.

df = (film
       .merge(inventory, on='film_id', how='left')    
       .merge(rental, on='inventory_id', how='left')
       .merge(payment, on='rental_id', how='left')
       .groupby(['rating','film_id'])[['amount']]
       .sum()
       .sort_values('amount', ascending=False)
       .rename(columns={'amount': 'sum'})
       .reset_index()
       .groupby('rating')
       .first()
       .merge(film, on='film_id', how='left')
       .sort_values('sum', ascending=False)
       .reset_index()
       .loc[:, ['film_id','title','rating','sum']]
      )

df

	film_id	title	rating	sum
0	879	Telegraph Voyage	PG	215.75
1	1000	Zorro Ark	NC-17	199.72
2	460	Innocent Usual	PG-13	191.74
3	764	Saturday Lambs	G	190.74
4	938	Velvet Terminator	R	152.77

▶ R

This is essentially the same approach as in Python.

%%R
df_r = film %>% 
  left_join(inventory, by='film_id') %>% 
  left_join(rental, by='inventory_id') %>% 
  left_join(payment, by='rental_id') %>% 
  group_by(rating, film_id) %>% 
  summarise(sum=sum(amount, na.rm=TRUE)) %>% 
  ungroup() %>% 
  arrange(desc(sum)) %>% 
  group_by(rating) %>% 
  filter(row_number()==1) %>% 
  left_join(film, by='film_id') %>% 
  select(film_id, title, rating.x, sum) 

as.data.frame(df_r)

  film_id             title rating.x    sum
1     879  Telegraph Voyage       PG 215.75
2    1000         Zorro Ark    NC-17 199.72
3     460    Innocent Usual    PG-13 191.74
4     764    Saturday Lambs        G 190.74
5     938 Velvet Terminator        R 152.77

Hopefully, the examples above show that a significant part of the syntax used to wrangle structured data (data arranged in tables), in both Python (pandas) and R (tidyverse), has been influenced by SQL.

Many of the pipelines we built are comprised of the same steps as the corresponding SQL query. This is especially true for basic queries where we do simple things like group by columns or build a simple aggregate measure. Similarly, syntactic differences aside, joining operations work in a comparable way across the three languages.

As we consider more complex operations, however, a direct port of the SQL querie to the other languages may not be the most optimal way. We saw a small example of this when we computed which days in March had no rentals. The good news, however, is that these languages aren’t mutually exclusive. After all, we did run all our SQL via Python (pandas). Rather these languages complement each other and can be used in conjunction to build pipelines that can perform complex manipulation of the data in an efficient way. Very often, SQL is used to first extract a subset of the data from a large database, which can then be manipulated and transformed further in Python or R.

To steal a line from Alexandre Dumas’ The Three Musketeers, this is truly a case of All for one and one for all!

python sql r pandas tidyverse data wrangling

SQL vs Python vs R

The data

Which movie rentals were below a dollar for customers with an id above 500?

▶ SQL

▶ Python

▶ R

Sanity check

Which transactions occured in the month of March only?

▶ SQL

▶ Python

▶ R

Which day saw the most business (largest number of transactions)?

▶ SQL

▶ Python

▶ R

How much did each customer spend and when?

▶ SQL

▶ Python

▶ R

Which days of March 2007 recorded no sale?

▶ SQL

▶ Python

▶ R

How many transactions were there for each day of March 2007?

▶ SQL

▶ Python

▶ R

Days with no transactions in March 2007 - Alternate version

▶ SQL

▶ Python

▶ R

Which movies have never been rented?

▶ SQL

▶ Python

▶ R

Find each customer’s first order.

▶ SQL

▶ Python

▶ R

Handling multiple conditions.

▶ SQL

▶ Python

▶ R

Customer lifetime value

▶ SQL

▶ Python

▶ R

How many movies have a replacement cost above or below the average replacement cost?

▶ SQL

▶ Python

▶ R

Which movie generated the more revenue, for each viewer rating category?

▶ SQL

▶ Python

▶ R

See also