Smith College Computer Science 250a, Fall 2007

Pattern Matching Lab

October 10, 2007

Boot the machine in the Linux Operating System.


Log in using your 250a class account. You have now logged into the local computer you are on, but have access to files on the Computer Science machine called beowulf.

run the Python Interpreter
To start the Python interpreter, just type in the command
python and hit Enter
You should obtain something like this:

Python 2.2.2 (#1, Feb 24 2003, 19:13:11) 
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
where >>> is the Python interpreter's prompt. The prompt is prompting you for a Python command, or statement. Try this:
>>> print "Hello, World!"
and also this:
>>> print 5+7
and this:
>>> print 16+32, "is the sum of 16 and 32"
Next, check out how easy string concatenation is in python with the plus (+) operator (+ is not or in python, as it is in our 250 textbook):
>>> x = "computer"
>>> y = "science"
>>> z = x + " " + science
>>> print z

Exit from the Python Intepreter

Exit from the Python interpreter by typing Ctrl-D.

Python's re module:
Regular expressions (or REs) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as ``Does this string match the pattern?'', or ``Is there a match for the pattern anywhere in this string?''. You can also use REs to modify a string or to split it apart in various ways. Most letters and characters will simply match themselves. For example, the regular expression

test
will match the string "test" exactly.
Meta-Characters: characters that are part of the re alphabet, but don't match themselves. Instead, they have special meaning to the re interpreter. The * is very much like the * we have seen in our theoretical regular expressions. In practice the re module allows other meta-characters. They are:
. ^ $ * + ? { [ ] \ | ( )
Meta-character Highlights: How to Use re
Make sure that you understand these examples, discussing them with each other. Go to How to Use Regular Expressions in Python:
Here is a nice tutorial on using regular expressions in python: http://www.python.org/doc/howto/
Another good source for Regular Expression Syntax is from the Python Library Reference on-lins at http://docs.python.org/lib/re-syntax.html Some highlights from it are given below:


Groups
Groups are marked by the "(", ")" metacharacters. "(" and ")" have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them. For example, you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of "ab".
>>> m = re.search('(ab)*', 'ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'
Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.
>>> m = re.search('(a(b)c)d', 'abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
Backreferences
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python's string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE. For example, the following RE detects doubled words in a string.
>>> p = r'(\b\w+)\s+\1'
>>> m = re.search(p, 'Paris in the the spring')
>>> m.group()
'the the'
where \b means Word boundary (non alphanumeric) and \w means word (alphanumeric character, i.e. equivalent to the class [a-zA-Z0-9_].) Backreferences like this aren't often useful for just searching through a string -- there are few text formats which repeat data in this way -- but you'll soon find out that they're very useful when performing string substitutions. Why wouldn't this pattern work on 'Paris in thethe spring' ?

Lookahead To read about lookahead, go to the web site http://www.regular-expressions.info/lookaround.html and scroll down to the section called Positive and Negative Lookahead Negative lookahead is indispensable if you want to match something not followed by something else.

Examples to Consider

The link below gives examples. Before reading them over, use emacs to open a file
emacs lab250.py
and in emacs type in the following.:
import re     # import the reg expr module functions

def main():   # define function called main
    a = "I like Red Hat Linux for development."
    m = re.search(r"(.{7})Linux", a)  # Any 7 characters then string 'Linux'
    b = m.group(1)   # The first group in parentheses
    print "<" + b + ">"
    print m.group(0)  # The whole match
Save and exit emacs then invoke python to call the function:
jfrankli-PB1:~ jfrankli$ python
Python 2.3.5 (#1, Mar 20 2005, 20:38:20) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lab250
>>> lab250.main()
<ed Hat >
ed Hat Linux
>>> 
You can also import and run re functions in the python interpreter:

jfrankli-PB1:~ jfrankli$ python
Python 2.3.5 (#1, Mar 20 2005, 20:38:20) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re     # import the reg expr module functions
>>> a = "I like Red Hat Linux for development."
>>> m = re.search(r"(.{7})Linux", a)
>>> b = m.group(1)
>>> print "<" + b + ">"
<ed Hat >
>>> print m.group(0)
'ed Hat Linux'
Now consider the examples from the free version of RedHat Linux6Unleashed http://freebooks.by.ru/view/RedHatLinux6Unleashed/rhl6u350.htm. Start half way down with the discussion on Regular Expressions. We'll go through this together. Now go through this example again and see if you remember what it does.
>>> a = r"Valentines is 2/14/2000. Don\'t forget!"
>>> import re
>>> m = re.search(r"\D\d{1,2}/\d{1,2}/\d{2,4}\D", a)
>>> m.group(0)
How about this one?
>>> s = "My, how are you"
>>> m = re.search("^how are you$", s)
>>> 
Why didn't this match? What can you do to make it match, including the question mark?

Here's one that makes me love python:
import re
def simul():
    a = "11/21/1999"
    m = re.search(r"^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})", a)
    month,day,year = m.groups()
    print year, month, day
Make sure you understand this before moving on.

Now try this one. First get the test file either at the beowulf prompt:
beowulf$ getcopy test.tst
or download test.tst here. Then in emacs in your file lab250.py, add this function definition:
def brack():
    infile = open("test.tst", "r")
    x = infile.readline()   # read one line from file
    while x != "":
        x = x[:-1]          # slice off newline char at end of string
        m = re.search(r"\[\s*(.*)\s*\]", x)  # treat bracket as just bracket char
        if m:
            print m.group(1) # just print inbside group
        x = infile.readline()
    infile.close()
Figure out why the lines that matched did by examining test.tst.

Now, we will go through some of these examples together:

http://www.regular-expressions.info/index.html is the main page, with menu on the left.

http://www.regular-expressions.info/tutorial.html

The following is one of the examples.
def email(addr):
    b = r"^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$"
    m = re.search(b, addr, re.IGNORECASE)   # allow upper or lower case
    print m.group(0)
\b means word boundary (anything that is not alphanumeric or underscore qualifies).
[A-Z0-9._%-]+ means one or more of the character class inside the square brackets.
@ just means there has to be an at sign next.
[A-Z0-9._%-]+ appears again, followed by \. which is how we get a period and avoid having it be a metacharacter. Lastly the [A-Z]{2,4} requires between 2 and 4 characters from A-Z and the \b makes it another word boundary.


This site below has examples of matching html tags, trimming white space, matching IP addresses, etc.: http://www.regular-expressions.info/examples.html. We examine one of its examples below.
>>> a = r'<html\b[^>]*>(.*?)</html>'
>>> m = re.search(a,'<html><H1>Test Page</H1><p>This is just a test</p></html>')
>>> m
<_sre.SRE_Match object at 0x96860>
>>> m.group()
'<html><H1>Test Page</H1><p>This is just a test</p></html>'
>>> m.group(0)
'<html><H1>Test Page</H1><p>This is just a test</p></html>'
>>> m.group(1)
'<H1>Test Page</H1><p>This is just a test</p>'
>>> m.group(2)
Traceback (most recent call last):
  File "", line 1, in ?
IndexError: no such group
Using backreference:
>>> a = r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>'
>>> a
'<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>'
>>> b = '<html><H1>Test Page</H1><p>This is just a test</p></html>'
>>> b
'<html><H1>Test Page</H1><p>This is just a test</p></html>'
>>> m = re.search(a,b,re.IGNORECASE)
>>> m
<_sre.SRE_Match object at 0x741d0>
>>> m.groups()
('html', '<H1>Test Page</H1><p>This is just a test</p>')
>>> 

Other Links

For a very succinct description of regular expressions in python, go to
Syntax of Regular Expressions in Python:
http://docs.python.org/lib/re-syntax.html
There is another tutorial at
http://gnosis.cx/TPiP/chap3.txt with many examples.

Don't forget to logout.