Introduction to regular expressions


Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions".
Now they have two problems.

1. Prelude:

Regular expressions are powerful...


http://xkcd.com/208/


...but at first, they can be puzzling.


http://xkcd.com/1171/
from math import *
import numpy as np
import pandas as pd
from pathlib import Path

%matplotlib inline
import matplotlib.pyplot as plt

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

2. Introduction

A regular expression (regex) is a sequence of literal characters, and metacharacters, which defines search patterns.

Most programming languages have some implementation of regular expressions, however, their syntax may vary.

One basic example is the asterisk used as a wildcard by most file systems to denote any number of characters.

For instance, *.txt denotes all files with a .txt extension.

The most elementary search pattern is the one that consists of the very characters you're looking for.

The find method of strings does just that.

s = "To be or not to be"
print(s.find('not'), s.find('question'))
9 -1

For a more general, and powerful approach to pattern searching, Python has the re module

import re

We’ll use the opening paragraphs from A Tale of Two Cities as an example.

dickens = '''
It was the best of times, it was the worst of times, 
it was the age of wisdom, it was the age of foolishness, it was the epoch of belief,
it was the epoch of incredulity, it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair, we had everything before us, we had
nothing before us, we were all going direct to Heaven, we were all going direct the other way -
in short, the period was so far like the present period, that some of its noisiest authorities
insisted on its being received, for good or for evil, in the superlative degree of comparison only.

There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with
a fair face, on the throne of France. In both countries it was clearer
than crystal to the lords of the State preserves of loaves and fishes,
that things in general were settled for ever.

It was the year of Our Lord one thousand seven hundred and seventy-five.
Spiritual revelations were conceded to England at that favoured period,
as at this. Mrs. Southcott had recently attained her five-and-twentieth
blessed birthday, of whom a prophetic private in the Life Guards had
heralded the sublime appearance by announcing that arrangements were
made for the swallowing up of London and Westminster. Even the Cock-lane
ghost had been laid only a round dozen of years, after rapping out its
messages, as the spirits of this very year last past (supernaturally
deficient in originality) rapped out theirs. Mere messages in the
earthly order of events had lately come to the English Crown and People,
from a congress of British subjects in America: which, strange
to relate, have proved more important to the human race than any
communications yet received through any of the chickens of the Cock-lane
brood.
'''

A. Literals

The search method of the re module returns a _sre.SRE_Match object which has some useful properties.

result = re.search('times', dickens)

print(type(result))
print(*[item for item in dir(result) if not item.startswith('_')], sep='\n')
<class '_sre.SRE_Match'>
end
endpos
expand
group
groupdict
groups
lastgroup
lastindex
pos
re
regs
span
start
string

The result of the search will tell us if, and where the pattern is found in the string.

print("result.span(): ", result.span())

print("result.group():", result.group())

s = result.span()

print(dickens[s[0]:s[1]])
result.span():  (20, 25)
result.group(): times
times

There are other useful functions for finding a pattern beside the search function.

Remark:


If a search pattern is to be reused, it is faster to compile it first. We'll do that henceforth.
pattern = re.compile(r'times')

# only matches the beginning of the string.
result = pattern.match(dickens)
print('match:  ', result)

# search for first match
result = pattern.search(dickens)
print('search: ', result.group())

# returns a list of all matches.
result = pattern.findall(dickens)
print('findall:', result)
match:   None
search:  times
findall: ['times', 'times']

Note:

What we've used above is an example of search pattern involving literals. The pattern was the very string we were looking for.

Notice that we didn't have to worry about punctuation. The pattern is the only sub-string returned by the search or findall function, if it is found.

B. Character classes

Sometimes we’d like to consider variants of a string. For, instance, "It" and "it".

This means that we need to look for either an uppercase “i” or a lowercase one, followed by the letter “t”.

Character classes are specified using square brackets.

pattern = re.compile(r'[Ii]t')

result = pattern.findall(dickens)

print(result)
['It', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'it', 'It', 'it', 'it', 'it', 'it', 'it']

This is a simple example of character set.

[Ii] means match any one of the characters between the square brackets.

Using character sets allows us to greatly simplify our syntax.

Python has some predefined character sets which help us write compact code.

Character Class Matches
[A-Z] any single letter of the Latin alphabet in uppercase
[a-z] any single letter of the Latin alphabet in uppercase
[A-z] any single letter of the Latin alphabet in either lowercase or uppercase
[0-9] any single digit between 0 and 9
[^0-9] any character except for single digit between 0 and 9

Here's an example:

re.findall(r'[0-9]', "Today is the 23rd of May")
['2', '3']
re.findall(r'[A-z]', "Today is the 23rd of May")
['T',
 'o',
 'd',
 'a',
 'y',
 'i',
 's',
 't',
 'h',
 'e',
 'r',
 'd',
 'o',
 'f',
 'M',
 'a',
 'y']
re.findall(r'[^A-z ]', "Today is the 23rd of May")
['2', '3']
re.findall(r'[^0-9]', "Today is the 23rd of May")
['T',
 'o',
 'd',
 'a',
 'y',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'r',
 'd',
 ' ',
 'o',
 'f',
 ' ',
 'M',
 'a',
 'y']

Back to our long string.

If we were searching for a substring made of any single uppercase letter followed by any single lower case letter we could write:

pattern = re.compile(r'[A-Z][a-z]')
result = pattern.findall(dickens)
print(result)
['It', 'Li', 'Da', 'He', 'Th', 'En', 'Fr', 'In', 'St', 'It', 'Ou', 'Lo', 'Sp', 'En', 'Mr', 'So', 'Li', 'Gu', 'Lo', 'We', 'Ev', 'Co', 'Me', 'En', 'Cr', 'Pe', 'Br', 'Am', 'Co']

Note:

  • Notice that except for a couple of elements in the list, we didn't get entire words.

  • The regular expression only specifies a pattern with one uppercase letter, followed by one lower case letter.

  • That's usually not what you're after. You'd probably want to extract whole words which start with an uppercase letter, followed by any number of lowercase letters (capitalised words).

Maybe all we need is to add more [a-z] sets to our patterns? Let’s try that…

pattern = re.compile(r'[A-Z][a-z][a-z]')
result = pattern.findall(dickens)
print(result)
['Lig', 'Dar', 'Hea', 'The', 'Eng', 'Fra', 'Sta', 'Our', 'Lor', 'Spi', 'Eng', 'Mrs', 'Sou', 'Lif', 'Gua', 'Lon', 'Wes', 'Eve', 'Coc', 'Mer', 'Eng', 'Cro', 'Peo', 'Bri', 'Ame', 'Coc']

That’s not really what we want.

Sure we’ve now got one more lowercase letter after the first, uppercase one, but we still don’t have whole words, unless our string contains capitalised, three-letter words. To top it off, we’ve now lost "It"!

In other words, unless you want to only extract capitalised words, with a specific number of letter, this is not the way to go.

Actually, even in that case, this is not a smart way to do it.

If we wanted to find all the capitalised words that are, say, 10 letters long, we don’t really want to have to type a pattern such as:

[A-Z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]

We need a way to specify that we want one or more lowercase letters.

For this we need more than literals or character sets, we need metacharacters.

C. Metacharacters

Literals are characters which are simply part of the pattern we are looking for.

Metacharacters, on the other hand, act like modifiers. They change how the literals, or character classes are handled.

By default each literal is matched only once. By using the + symbol, any character or character class appearing just before the metacharacter will be matched one or more times.

There are several modifiers that we can use.

Modifier Number of occurences
+ one or more
* zero or more
? zero or one
{m} m times
{m, n} between m and n times

For instance, if we wanted to extract a list of the words that are capitalised in a string, in the past we may have written something like this:

print([word for word in dickens.strip().split() if word[0].isupper() and word[1].islower()])
['It', 'Light,', 'Darkness,', 'Heaven,', 'There', 'England;', 'France.', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs.', 'Southcott', 'Life', 'Guards', 'London', 'Westminster.', 'Even', 'Cock-lane', 'Mere', 'English', 'Crown', 'People,', 'British', 'America:', 'Cock-lane']

Notice, however, that some of the words have punctuation symbols attached to them.

No big deal, we know how to deal with this.

import string 
print([word.strip(string.punctuation) for word in dickens.split() if word[0].isupper() and word[1].islower()])
['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Cock-lane', 'Mere', 'English', 'Crown', 'People', 'British', 'America', 'Cock-lane']

That worked fine, however the syntax is a bit unwieldy.

Another version could be.

import string 
print([word.strip(string.punctuation) for word in dickens.strip().split() 
 if word.istitle()])
['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Mere', 'English', 'Crown', 'People', 'British', 'America']

Notice, however, that we lost a word…

Using a regular expression, we can specify character sets which match our pattern.

pattern = re.compile(r'[A-Z][a-z]+')
result = pattern.findall(dickens)
print(result)
['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Cock', 'Mere', 'English', 'Crown', 'People', 'British', 'America', 'Cock']

Notice that with this pattern, hyphenated words are split and only the first part is returned. Let’s handle this case.

pattern = re.compile(r'[A-Z][a-z]+-?[a-z]*')
result = pattern.findall(dickens)
print(result)
['It', 'Light', 'Darkness', 'Heaven', 'There', 'England', 'France', 'In', 'State', 'It', 'Our', 'Lord', 'Spiritual', 'England', 'Mrs', 'Southcott', 'Life', 'Guards', 'London', 'Westminster', 'Even', 'Cock-lane', 'Mere', 'English', 'Crown', 'People', 'British', 'America', 'Cock-lane']

D. Other built-in character classes and metacharacters

Class Matches
. any character except \n
\d Any numeric character
\D Non-numeric character
\w alphanumeric characters (same as [0-9a-zA-Z_])
\W Non-alphanumeric characters
\b word boundary
\s whitespace character (including \n, \t)
\S Non-whitespace character
^ Start of line
$ End of line

First word of each sentence.

We use re.MULTILINE to have the patterned searched across more than one line.

result = re.findall(r'^\w+', dickens, re.MULTILINE)
print(result)
['It', 'it', 'it', 'it', 'nothing', 'in', 'insisted', 'There', 'throne', 'a', 'than', 'that', 'It', 'Spiritual', 'as', 'blessed', 'heralded', 'made', 'ghost', 'messages', 'deficient', 'earthly', 'from', 'to', 'communications', 'brood']

Last word of each sentence.

This is a bit more tricky. First note the following:

result = re.findall(r'\w+$', dickens, re.MULTILINE)
print(result)
['had', 'authorities', 'the', 'with', 'clearer', 'twentieth', 'had', 'were', 'lane', 'its', 'supernaturally', 'the', 'strange', 'any', 'lane']

The problem here is that our pattern will only consider a word as being a match if it is at the end of a line if there is no other character after it.

Let’s try to include the possibility of a punctuation symbol.

result = re.findall(r'\w+.?$', dickens, re.MULTILINE)
print(result)
['belief,', 'Darkness,', 'had', 'authorities', 'only.', 'the', 'with', 'clearer', 'fishes,', 'ever.', 'five.', 'period,', 'twentieth', 'had', 'were', 'lane', 'its', 'supernaturally', 'the', 'People,', 'strange', 'any', 'lane', 'brood.']

That’s better but we don’t actually want the punctuation symbols to appear in the result.

E. Capture Groups

We can use a capture group to specify which part of the pattern should be returned as a group, using parentheses.

Let’s first see how this works on a simple example.

No groups

We get a list back.

re.findall(r'\w+\s\d+\w{0,2}', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")
['November 9', 'at 5pm', 'November 12', 'at 11am', 'or 4pm']

2 capture groups

We get a list of tuples with 2 elements.

re.findall(r'(\w+)\s(\d+)\w{0,2}', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")
[('November', '9'), ('at', '5'), ('November', '12'), ('at', '11'), ('or', '4')]

1 non-capture group, starting with (?: and 1 capture group.

re.findall(r'(?:\w+)\s(\d+)\w{0,2}', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")
['9', '5', '12', '11', '4']

Note:

This example is a bit lame. The same result could be achieved more simply, but it illustrates how non-capture groups work.

re.findall(r'\d+', "Let's meet on November 9 at 5pm, or November 12 at 11am or 4pm.")
['9', '5', '12', '11', '4']

Back to our problem of finding the last words of each line.

result = re.findall(r'(\w+).?$', dickens, re.MULTILINE)
print(result)
['belief', 'Darkness', 'had', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'five', 'period', 'twentieth', 'had', 'were', 'lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'lane', 'brood']

Almost there. We just need to take care of hyphenated words and whitespaces.

# You can usually write multiple regular expressions to perform a task,
# however, it's best to try and make the regular expression as discriminating as possible.

# Method 1
result = re.findall('([\w+-?]+\w+).?[\s-]*$', dickens, re.MULTILINE)
print(result)


# Method 2
result = re.findall('([\w+-?]+\w+).?\W*$', dickens, re.MULTILINE)
print(result)


# Method 3
result = re.findall('([\w-]+)\W*$', dickens, re.MULTILINE)
print(result)
['times', 'belief', 'Darkness', 'had', 'way', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'seventy-five', 'period', 'five-and-twentieth', 'had', 'were', 'Cock-lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'Cock-lane', 'brood']
['times', 'belief', 'Darkness', 'had', 'way', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'seventy-five', 'period', 'five-and-twentieth', 'had', 'were', 'Cock-lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'Cock-lane', 'brood']
['times', 'belief', 'Darkness', 'had', 'way', 'authorities', 'only', 'the', 'with', 'clearer', 'fishes', 'ever', 'seventy-five', 'period', 'five-and-twentieth', 'had', 'were', 'Cock-lane', 'its', 'supernaturally', 'the', 'People', 'strange', 'any', 'Cock-lane', 'brood']

3. Data Extraction

When we're working with data you haven't generated, we usually have little control over the formatting of the data.

What's worse, the formatting can be inconsistent. This is where regular expression can be extremely useful.

Consider the data below. We'd like to extract the credit card information for each entry.

Notice that the credit card numbers are not formatted in a consistent way.

messy_data = '''
Ms. Dixie T Patenaude 18 rue Descartes STRASBOURG Alsace 67100 FR France Dixie.Patenaude@teleworm.us Shound Cheecaey3s 03.66.62.81.38 Grondin 4/15/1958 MasterCard 5379 7969 2881 8421 958 12/2017 nan 1Z 114 58A 80 2148 893 8 Blue Safety specialist Atlas Realty 2000 Subaru Outback AnalystWatch.fr O- 191.0 86.8 5' 11" 156 dd0548bb-a8b5-438d-b181-c76ad282a9a1 48.577584 7.842637
Mr. Silvano G Romani 34 Faunce Crescent WHITE TOP NSW 2675 AU Australia Silvano-Romani@einrot.com Pock1993 AeV7ziek (02) 6166 5988 Sagese 2/25/1993 MasterCard 5253-7637-4959-3303 404 06-2018 nan 1Z 814 E43 42 9322 015 2 Green Coin vending and amusement machine servicer repairer Miller & Rhoads 1998 Honda S-MX StarJock.com.au B+ 128.3 58.3 6' 2" 189 7e310daa-46f5-407e-8dda-d975715ac4d5 -33.429793 145.234214
Mr. Felix C Fried 37 Jubilee Drive CAXTON nan CB3 5WG GB United Kingdom FelixFried@rhyta.com Derser Aisequ0haz 078 1470 0903 Eisenhauer 1/19/1933 Visa 4716046346218902 738 02 2018 SP 39 75 51 D 1Z V88 635 94 7608 112 4 Blue School psychologist Wetson's 2001 Audi Allroad MoverRelocation.co.uk O+ 188.5 85.7 5' 7" 169 95515377-74a9-4c1e-9117-44b9753dad8c 51.922175 -0.353221
'''
credit_card_number = re.compile(r'(\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4})')
credit_card_number.findall(messy_data)
['5379 7969 2881 8421', '5253-7637-4959-3303', '4716046346218902']

Let’s write a regular expression called height to extract the height of each person in the data.

height = re.compile(r'(\d)\'\s+(\d{,2})"')

height.findall(messy_data)
[('5', '11'), ('6', '2'), ('5', '7')]

For more complex patterns we can use named capture groups.

A. Named Capture Groups

Let's write a regular expression that will extract not just the credit card number, but also the card type, CCV number, and expiry date.

Note:

When you are extracting several groups that correspond to a particular type of information, it's often useful to associate a name with each group. That's what named groups allow you to do.

A named group has the form (?P<name>regex).

credit_card_details = re.compile('(?P<CardType>Visa|MasterCard).*'
                                 '(?P<CardNumber>\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4})[-\s]?'
                                 '(?P<CCV>\d{3})\s+'
                                 '(?P<Expiry>\d{2}[- /]{1}\d{4})')
for line in messy_data.strip().splitlines():
    print(credit_card_details.search(line).groupdict())
{'CardType': 'MasterCard', 'CardNumber': '5379 7969 2881 8421', 'CCV': '958', 'Expiry': '12/2017'}
{'CardType': 'MasterCard', 'CardNumber': '5253-7637-4959-3303', 'CCV': '404', 'Expiry': '06-2018'}
{'CardType': 'Visa', 'CardNumber': '4716046346218902', 'CCV': '738', 'Expiry': '02 2018'}

We can get a nicer output by transforming the dictionary into a dataframe.

pd.DataFrame([credit_card_details.search(line).groupdict() for line in  messy_data.strip().splitlines()])
CCV CardNumber CardType Expiry
0 958 5379 7969 2881 8421 MasterCard 12/2017
1 404 5253-7637-4959-3303 MasterCard 06-2018
2 738 4716046346218902 Visa 02 2018

Let’s now write a regular expression email that will extract the email address from our data, and return the login and internet service provider for each entry, as two named capture groups.

email = re.compile(r'(?P<Login>\S+)@(?P<ISP>\S+)')

pd.DataFrame([email.search(line).groupdict() for line in  messy_data.strip().splitlines()])
ISP Login
0 teleworm.us Dixie.Patenaude
1 einrot.com Silvano-Romani
2 rhyta.com FelixFried

B. Pandas & Regular expressions

For more complex data sets we can of course combine the strengths of regular expressions and Pandas.

As an example, let’s revisit the birthday formatting problem of a previous post.

path = Path("data/people.csv")
data = pd.read_csv(path, encoding='utf-8')
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 8/25/1968 A+
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1/31/1962 A+
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1/10/1964 B+
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 4/12/1933 O-
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 11/20/1946 A+

We'd like the dates to be formatted as year-month-day.

In a previous notebook we worked out a solution using the datetime module.

Let's do this using regular expressions.

Remark:

In a previous post we did more than simply change the format of the dates. We created a special datetime objects which can be used to perform computations on dates. Here we're merely focussing on the formatting aspect to illustrate how regular expressions can be combined with our usual Pandas workflow.

# Compile the regex first for increased speed.
birthday = re.compile(r'(?P<month>\d{1,2})/(?P<day>\d{1,2})/(?P<year>\d{2,4})')

def transform_birthday(row):
    date = birthday.search(row['Birthday']).groupdict()
    return "-".join([date['year'], date['month'], date['day']])
    
data['Birthday'] = data.apply(transform_birthday, axis=1)
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 1968-8-25 A+
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1962-1-31 A+
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1964-1-10 B+
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 1933-4-12 O-
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 1946-11-20 A+

By using Pandas’ advanced regex syntax we can achieve the same thing in one line of code.

Let’s reload the original data.

data = pd.read_csv(path, encoding='utf-8')
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 8/25/1968 A+
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1/31/1962 A+
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1/10/1964 B+
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 4/12/1933 O-
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 11/20/1946 A+

We can extract the day, month and year of birth for each person in one line of code by passing the compiled regex to the str.extract method of our Pandas Series corresponding to the Birthday column.

data.Birthday.str.extract(birthday, expand=True).head()
month day year
0 8 25 1968
1 1 31 1962
2 1 10 1964
3 4 12 1933
4 11 20 1946

Note that the names of the columns have been automatically extracted from the named groups of the regex birthday.

Using this, we can use the apply method to process the date in the format that we want.

For more clarity, we wrap our method chaining expression in parentheses, which allows us to write each method call on a new line.

data.Birthday = (data
                     .Birthday
                     .str
                     .extract(birthday, expand=True)
                     .apply(lambda date:"-".join([date['year'], 
                                                  date['month'], 
                                                  date['day']]), 
                            axis=1)
                )
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 1968-8-25 A+
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1962-1-31 A+
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1964-1-10 B+
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 1933-4-12 O-
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 1946-11-20 A+

That being said, the best way to achieve our goal in pandas is to let it parse the dates automatically!

data = pd.read_csv(path, encoding='utf-8', parse_dates=['Birthday'])
data.head()
GivenName Surname Gender StreetAddress City Country Birthday BloodType
0 Stepanida Sukhorukova female 62 Ockham Road EASTER WHYNTIE United Kingdom 1968-08-25 A+
1 Hiệu Lương male 4 Iolaire Road NEW CROSS United Kingdom 1962-01-31 A+
2 Petra Neudorf female 56 Victoria Road LISTON United Kingdom 1964-01-10 B+
3 Eho Amano female 83 Stroud Rd OGMORE United Kingdom 1933-04-12 O-
4 Noah Niland male 61 Wrexham Rd FACEBY United Kingdom 1946-11-20 A+
data.dtypes
GivenName                object
Surname                  object
Gender                   object
StreetAddress            object
City                     object
Country                  object
Birthday         datetime64[ns]
BloodType                object
dtype: object

The End…


See also