Getting Started With Regular Expressions in Python

  • What is a Regular Expression - A sort of "language" made up of a sequence of characters that define a search expression. This can then be passed to a search or replace function.
  • Not at all unique to Python. Have been around a long time. Most editors support them and they're definitely worth learning. In UNIX the search tool grep stands for "Generalized Regular Expression Parser", so it's a search tool for looking for matches in a set of files.

This is NOT a Complete Guide

We'll whet your apetite with some examples, but this is not a complete guide to regular expressions. Whole web sites / books are devoted to regular expressions, and it's worthwhile spending some more time on them.

Exact Match -- Simple But Can Do Without Regular Expressions Too

The simplest case is an exact match:

In [43]:
# Need to do "import re" to get regular expression
import re

# Find a needle in a haystack
haystack = "There is a needle in here somewhere"
needle = "needle"
match = re.search(needle, haystack)
if match:
    print("I found a %s at index %d." % (match.group(0), match.start()))        
else:
    print("Not found!")   
I found a needle at index 11.

NOTE: You can also do an exact match search without a regular expression, too. You can use str.find to get the index of a string (or return -1 if not found). We show how to get the original string out, which is a bit ugly (and also, since we're doing an exact match, unnecessary). But we leave it in as an illustration.

In [42]:
# Prints the index of the needle
index = str.find(haystack, needle)

if index > -1:    
    needle_again = haystack[index:index+len(needle)]
    print("At index %d, I found a %s." % (index, needle_again))
    
At index 11, I found a needle.

Better Uses for Regular Expressions, Matching a Pattern

In the following example we use two powerful features of regular expressions:

  • Shorthand character classes - ways to match a type or category of characters, not an exact match.
  • Quantifiers ways to say you want to match on exactly N or "zero or more", "one or more", etc.

For example a \d matches any single digit, but \d+ matches any group of one or more digits. Below we match a single digit, and any group of digits:

In [82]:
def find_single_digit():
    pattern = r'\d'
    string = 'One is the loneliest number, not  77, not 52'
    match = re.search(pattern, string)
    if match:
        print("Found a digit: ", match.group(0))
    
# Find the first GROUP of digits
def find_single_group_of_digits():
    pattern = r'\d+'
    string = 'One is the loneliest number, not  77, not 52'
    match = re.search(pattern, string)
    if match:
        print("Found a group:", format(match.group(0)))

# Find all the groups of digits using findall
def find_all_groups_of_digits():
    pattern = r'\d+'
    string = 'One is the loneliest number, not  77, not 52'
    match_list = re.findall(pattern, string)
    if (match_list):
        print("Found a list: ", match_list)

find_single_digit()
find_single_group_of_digits()
find_all_groups_of_digits()
Found a digit:  7
Found a group: 77
Found a list:  ['77', '52']

Inverses are often upper-case. So \s is a whitespace character, \S is not a whitespace character, etc.

In [86]:
# Match a non-digit character + oo + non-digit character
# Excludes '9oo8'
pattern = r'\Doo\D'
string = 'A good book is food for the mind, but 9oo8 makes no sense, does it?'
print(re.findall(pattern, string))
['good', 'book', 'food', '9oo8']

Matching a 5-digit zip code

In [91]:
import re

# Find zip codes, take 1

# r'\d{5}'  = r (raw) \d (digits) {5} (exactly 5)
pattern = r'\d{5}'  

string_with_zips = '''My zip code is 28262. I live in Charlotte.   
I used to live in 02895.  I never lived in Hernando, Mississippi, 
38632-8945'''

print(re.findall(pattern, string_with_zips))
['28262', '02895', '38632']

An improved zip code matcher

The example below shows that regular expressions can get pretty complex! It's a good idea to always add plenty of comments, since regular expressions can be hard to understand if you're not the author. (Or if even if you ARE the author and you return to them later).

In [102]:
# Match exactly five digits, MAYBE followed by a hyphen and exactly four digits
#     \d{5}  (Exactly 5 digits)
#     (?:)   (non-capturing group)
#     -\d{4} (A hyphen followed by exactly 4 digits)
#     ?      (Final ? makes the non-capturing group containing hyphen and four digits optional)
better_pattern  = r'\d{5}(?:-\d{4})?'  

print(re.findall(better_pattern, string_with_zips))
['28262', '02895', '38632-8945']