Regular Expressions¶

In Python a regular expression search is typically written as: match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise.

The code match = re.search(pat, str) stores the search result in a variable named "match".

PASSING RAW STRINGS TO RE.COMPILE( )¶

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. It is always better to write pattern strings with the 'r' just as a habit.

Escape characters in Python use the backslash (\). The string value '\n' represents a single newline character, not a backslash followed by a lowercase n. You need to enter the escape character \ to print a single backslash. So '\n' is the string that represents a backslash followed by a lowercase n. However, by putting an r before the first quote of the string value, you can mark the string as a raw string, which does not escape characters.

Since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the re.search() function instead of typing extra backslashes. Typing r'\d\d\d-\d\d\d-\d\d\d\d' is much easier than typing '\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'.

Basic Patterns¶

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period) -- matches any single character except newline '\n'
\+ -- match one or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
\* -- match one or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left
\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
\b -- boundary between word and non-word
\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [\n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^ = start, $ = end -- match the start or end of the string
\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

>>> import re
>>> match = re.search(r'\d\d\d-\d\d\d-\d\d\d\d','My number is 415-555-4242.')
>>> match.group()
'415-555-4242'

Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.

>>> match = re.search(r'(\d\d\d)-(\d\d\d-\d\d\d\d)','My number is 415-555-4242.')
>>> match.group()
'415-555-4242'
>>> match.group(1)
'415'
>>> match.group(2)
'555-4242'

Match Multiple Groups with the Pipe (r'Last|First') will match either Last or First match = re.search(r'First|Last','My First name is Jon')

>>> match.group()
'First'
match = re.search(r'First|Last','My Last name is Dam')
>>> match.group()
'Last'

By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.

>>> mo = re.search(r'Bat(man|mobile|copter|bat)','Batmobile lost a wheel')
>>> mo.group()
'Batmobile'

To match a pattern optionally, that is to find a match if that bit of text is there or not, use ? match = re.search(r'Bat(wo)?man','The Adventures of Batman')

>>> match.group()
'Batman'
match = re.search(r'Bat(wo)?man','The Adventures of Batwoman')
>>> match.group()
'Batwoman'

>>> match = re.search(r'jeev?a', 'jeeva')
>>> match.group()
'jeeva'
>>> match = re.search(r'jeev?a', 'jeea')
>>> match.group()
'jeea'

Matching Zero or More with the Star¶

The * (called the star or asterisk) means “match zero or more”

>>> match = re.search(r'Bat(wo)*man','The Adventures of Batman')
>>> match.group()
'Batman'
>>> match = re.search(r'Bat(wo)*man','The Adventures of Batwowowowoman')
>>> match.group()
'Batwowowowoman'
>>> match = re.search(r'jeev*a', 'jeevvvvvvvvvvvvvvvvvvvvva')
>>> match.group()
'jeevvvvvvvvvvvvvvvvvvvvva'
>>> match = re.search(r'\d\s*\d\s*\d', 'xx123xx’) #digit+ 0 or more space +digit + 0 or more space + digit
>>> match.group()
'123'

Matching One or More with the Plus¶

While * means “match zero or more,” the + (or plus) means “match one or more.”

>>> match = re.search(r'Bat(wo)+man','The Adventures of Batwoman')
>>> match = re.search(r'Bat(wo)+man','The Adventures of Batwoman')
>>> match.group()
'Batwoman'
>>> match = re.search(r'Bat(wo)+man','The Adventures of Batwowowowoman')
>>> match.group()
'Batwowowowoman'
>>> match = re.search(r'Bat(wo)+man','The Adventures of Batman')
>>> match.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

Matching Specific Repetitions with Curly Brackets¶

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

You can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns:

(Ha){3}
(Ha)(Ha)(Ha)

(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha)

(Ha) {,2)
()|(Ha)|((Ha) (Ha))

Greedy and Nongreedy Matching¶

Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

>>> match = re.search(r'(Ha){3,5}','HaHaHaHaHa')
>>> match.group()
'HaHaHaHaHa'
>>> match = re.search(r'(Ha){3,5}?','HaHaHaHaHa')
>>> match.group()
'HaHaHa'

The findall() Method¶

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string.

>>> match = re.search(r'\d\d\d-\d\d\d-\d\d\d\d','Cell: 415-555-9999 Work: 212-555-0000')
>>> match.group()
'415-555-9999'
>>> re.findall(r'\d\d\d-\d\d\d-\d\d\d\d','Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']

Sample email matching:¶

email = 'john-doe@xyz.com'
re.search(r'\w+@\w',email) ==> 'doe@x'
re.search(r'\w+@\w+',email) ==> 'doe@xyz'
re.search(r'[\w-]+@\w+',email) ==> 'john-doe@xyz'
re.search(r'[\w-]+@[\w.]+',email) ==> 'john-doe@xyz.com'
>>> match  = re.search(r'([\w-])+@([\w.]+)',email)
>>> match.group()
'john-doe@xyz.com'
>>> match.group(1)
'john-doe'
>>> match.group(2)
'xyz.com'

findall With Files¶

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

Substitution (optional)¶

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

To remove multiple space characters:

>>> a = '  This is a     test       line        '
>>> print re.sub('\s+',' ',a)
 This is a test line

Using space itself instead of /s

>>> re.sub(' +',' ', a)
' This is a test line '