Wednesday, June 17, 2009

What to do when lazy matching and lookaheads won't work

You want to use regexes to search for a thing, then anything, then another thing.

Here's some example data to search through.

5678 Appletree Ave, MI
6701 Buttertown Road, MA
1337 Lolcat Lane, NM
5691 Appletree Way, MI
7832 Appletree Terrace, CT
4935 Appletree Motorway, MI

What you want to find here is any street named "Appletree" that is in Michigan.

The first thing most people try is .* or .+ but that matches the entire rest of the line which isn’t what they want. Using the regex "Appletree.*MI" will give you:
5678 Appletree Ave, MI
5691 Appletree Way, MI
7832 Appletree Terrace, CT
4935 Appletree Motorway, MI

You didn't want the Connecticut address in there. The easiest thing to do is to make the .* a lazy match by using ".*?" instead of ".*". This will give you what you want.

However, some parsers won't allow lazy matching. So what you can do is use a negated character class.

The first regex you might try is: Appletree[^M]+MI

In English, that’s match Appletree, then one or more characters that aren’t a capital M, then MI. This regex will miss Appletree Motorway but it's certainly better than not being able to search at all. At this point you could modify your regex to find something that wouldn't appear in the in between bit like the comma. The new regex would be: Appletree[^,]+,\sMI and it would find exactly what you wanted.

No comments:

Post a Comment