Regular Expression Searches

Top Previous Next

Support for regular expressions is common in text editing software and several programming languages. When the option to use them is enabled in a search dialog, certain character sequences in the Find what field are treated as wildcards, or codes that represent classes of characters rather than specific characters. The codes are formed with the help of metacharacters belonging to the sequence \ ^ $ . | [ ] ( ) { } ? + *. Another useful term is regex, the short name for regular expression. Unfortunately, not all regex search engines support the same codes. A good introduction to the Perl 5 flavor recognized by Walls is Jan Goyvaerts' tutorial.

The examples below are not an adequate introduction but they can at least serve as a memory refresher. They also demonstrate the use of codes that another search engine might handle differently. You can try them out on a Walls data file or project branch by cutting and pasting them into a search dialog.

Suppose we want to find all vertical shots in our survey data assuming that the vertical angle measurement is the fifth item on a line:

Find what: ^\s*(\S+\s+){4}[+-]?90(\.0*)?(\s|$)

We've constructed this pattern by using appropriate wildcards for the character sequences we want it to match -- sequences that appear contiguously left-to-right on a line. To match the fifth line item we must actually begin matching characters at the line's start:

•	The circumflex (^) matches the start-of-line condition.

•	\s* matches zero or more consecutive whitespace characters (tabs and spaces). (\t would have matched a single tab character.)

•	(\S+\s+){4} matches four repetitions of one or more non-whitespace characters (\S+) followed by one or more whitespace characters (\s+).

•	[+-]? matches one instance or zero instances of one of the characters in brackets, in this case either + or -.

•	90(\.0*)? matches one of 90, 90., 90.0, 90.00, etc. The decimal point is escaped with a backslash since a period by itself is a metacharacter that matches any character.

•	(\s\|$) matches either a whitespace character (\s) or the end-of-line condition ($). The vertical bar (\|) separates alternatives. (Bill\|William) would match either Bill or William.)

The pattern is complex because we're trying to capture all possible formats for a vertical shot while eliminating non-qualifying matches. Although not used in this example, a few additional codes besides \s and \S are available for matching generic classes of characters. \d and \D match digits and non-digits, respectively. \w and \W match word and non-word characters, where a word is any mixture of digits, letters, and the underscore character (_). Some codes match conditions that exist before or after an examined character. \b asserts the presence of a word boundary while \B asserts the lack of one. As with the line start (^) and line end ($) conditions, no character is matched, or "consumed", with \b or \B. In Walls dialogs the option Match whole words is equivalent to prefixing and suffixing the target string with \b.

The element [+-] in this example deserves futher explanation. A bracketed sequence of characters, a character class, matches any one character belonging to the set. As part of such a sequence most characters will be interpreted literally, not as metacharacters. Exceptions are ^, -, and \, depending on their placement. For example, the character class [a-zA-Z] will match any alphabetic character due to the placement of minus signs to define character ranges. In character class [+-] neither the plus sign nor the minus sign has a special meaning. Codes that match conditions, like \b, are not recognized as such in a character class, but codes representing characters, like \s, have the same meaning inside brackets as they do outside. When a circumflex (^) immediately follows the opening bracket, a negated character class is formed, one that matches any character as long as it's not one of the remaining characters in brackets.

The Walls search function is implemented so that it looks for matching text that resides entirely within individual lines. The expression ^$, however, won't match an empty line since empty lines are completely ignored. The next example selects the two station names of a vector data line, presumably the first two items, while skipping over line comments and directives. The replacement string swaps name positions and illustrates how a replacement string can be constructed from portions of the matched text:

Find what: ^(\s*)([^;#]\S*)(\s+)(\S+)

Replace with: \1\4\3\2

The subexpression [^;#] is a negated character class that matches anything but a semicolon or pound sign. In this context it can't match a whitespace character either since any such character will have been consumed by the preceding \s*. The sets of parentheses in the whole expression are needed to form groups that can be referenced in the replacement string. Up to nine left-to-right counted parenthesized groups, possibly nested, can be referenced using codes \1 through \9. (The entire matched text, the selected string, is referenced by \0.) Similarly, a group can be backreferenced in the regular expression itself.

The final example is the sort of operation you might conceivably find useful when working with survey data:

Find what: (^|[\s,:;])A(\d)

Replace with: \1AB\2

This would change A-prefixed names (A1, A321, etc.) to AB-prefixed names (AB1, AB321, etc.) provided their positions in the file are plausible for a station name. The expression, (^|[\s,:;]), insures that immediately preceding the letter A is either the start-of-line condition or a character in the bracketed set (whitespace, comma, colon, or semicolon). The (\d) insures that a digit immediately follows A, the parentheses allowing it to be referenced in the replacement string as \2.

Walls takes advantage of the excellent code library, PCRE (Perl Compatible Regular Expressions), written by Philip Hazel. The documentation for PCRE contains the most complete description of the regex flavor supported by Walls.