RegEx

From Xojo Documentation

Class (inherits from Object)

Used to do search and replace operations using regular expressions (i.e., perl). The RegEx class uses the current version of the PCRE library, 8.33.

Properties
Options SearchPattern
ReplacementPattern SearchStartPosition


Methods
Replace Search

Notes

This section describes the syntax of regular expressions.

Pattern Description
. Matches any character except newline.
[a-z0-9] Matches any single character of set.
[^a-z0-9] Matches any single character not in set.
\d Matches a digit. Same as [0-9].
\D Matches a non-digit. Same as [^0-9].
\w Matches an alphanumeric (word) character - [a-zA-Z0-9_].
\W Matches a non-word character [^a-zA-Z0-9_].
\s Matches a whitespace character (space, tab, newline, etc.).
\S Matches a non-whitespace character.
\n Matches a newline (line feed).
\r Matches a return.
\t Matches a tab.
\f Matches a formfeed.
\b Matches a backspace.
\0 Matches a null character.
\000 Also matches a null character because of the following:
\nnn Matches an ASCII character of that octal value.
\xnn Matches an ASCII character of that hexadecimal value.
\cX Matches an ASCII control character.
\metachar Matches the meta-character (e.g., \, .).
(abc) Used to create subexpressions. Remembers the match for later backreferences. Referenced by replacement patterns that use \1, \2, etc.
\1, \2,… Matches whatever first (second, and so on) of parens matched.
x? Matches 0 or 1 x's, where x is any of above.
x* Matches 0 or more x's.
x+ Matches 1 or more x's.
x{m,n} Matches at least m x's, but no more than n.
abc Matches all of a, b, and c in order.
a|b|c Matches one of a, b, or c.
\b Matches a word boundary (outside [] only).
\B Matches a non-word boundary.
^ Anchors match to the beginning of a line or string.
$ Anchors match to the end of a line or string.


Replacement Patterns

The following expressions can only apply to the replacement pattern:

Pattern Description
$` Replaced with the entire target string before match.
$& The entire matched area; this is identical to \0 and $0.
$' Replaced with the entire target string following the matched text.
$0-$50 $0-$50 evaluate to nothing if the subexpression corresponding to the number doesn't exist.
\0-\50
\xnn Replaced with the character represented by nn in Hex, e.g., ™is ™.
\nnn Replaced with the character represented by nn in Octal.
\cX Replaced with the character that is the control version of X, e.g., \cP is DLE, data line escape.

Double-byte Systems

If you are working with a double-byte system such as Japanese, RegEx cannot operate on the characters directly. You should first convert all double-byte text to UTF8 using the built-in Text Converter functions. See, for example, the TextConverter class for an example of how to use the Text Converter functions.

All text that will be processed by RegEx should be converted. This includes SearchPattern, ReplacementPattern, and TargetString. The result of the Search or Search and Replace will be a UTF8 string, so you will need to convert it back to its original form using the Text Converter functions. Both Search and Search and Replace operations work on all platforms, provided that this conversion takes place.

Regular Expression Examples

The basic idea of regular expressions is that it enables you to find and replace text that matches the set of conditions you specify. It extends normal Search and Replace with pattern searching.


Wildcards

Some special characters are used to match a class of characters:

Wildcard Matches
. Any single character except a line break, including a space.

If you use the "." as the search pattern, you will select the first character in the target string and, if you repeat the search, you will find each successive character, except for Return characters

The following wildcards match by position in a line:

Wildcard Matches Example
^ Beginning of a line (unless used in a character class; see below) ^Phone: Finds lines that begin with "Phone":
$ End of a line (unless used in a character class) $: Finds the last character in the current line.

Character Classes

A character class allows you to specify a set or range of characters. You can choose to either match or ignore the character class. The set of characters is enclosed in brackets. If you want to ignore the character class instead of match it, precede it by a caret (^). Here are some examples:

Character Class Matches
[aeiou] Any one of the characters a, e, i, o, u.
[^aeiou] Any character except a, e, i, o, u.
[a-e] Any character in the range a-e, inclusive
[a-zA-Z0-9] Any alphanumeric character. Note: Case-sensitivity is controlled by the CaseSensitive property of the RegExOptions class.
[[] Finds a [.
[]] Finds a ]. To find a a closing bracket, place it immediately after the opening bracket.
[a-e^] Finds a character in the range a-e or the caret character. To find the caret character, place it anywhere except as the first character after the opening bracket.
[a-c-] Finds a character in the range a-c or the - sign. To match a -, place it at the beginning or end of the set.

Non-printing Characters

You can use the following notation to find non-printing characters:

Special Character Matches
\r Line break (return)
\n Newline (line feed)
\t Tab
\f Formfeed (page break)
\xNN Hex code NN.

Other Special Characters

The following patterns are wildcards for the following special characters:

Special Character Matches
\s Any whitespace character (space, tab, return, linefeed, form feed)
\S Any non-whitespace character.
\w Any "word" character (a-z, A-Z, 0-9, and _)
\W Any "non-word" character (All characters not included by \w).
\d Any digit [0-9].
\D Any non-digit character.


Repetition Characters

Repetition characters are modifiers that allow you to repeat a specified pattern.

Repetition Character Matches Examples
* Zero or more characters. d* finds no characters, or one or more consecutive "d"s.

.* finds an entire line of text, up to but not including the return character.

+ One or more characters. d+ finds one or more consecutive "d"s.

[0-9]+ finds a string of one or more consecutive numbers, such as "90404", "1938", the "32" in "Win32", etc.

? Zero or one characters. d? finds no characters or one "d".

Please note that, since * and ? match zero instances of the pattern, they always succeed but may not select any text. You can use them to specify an optional character, as in the examples in the following section.

"Greediness"

The "?" is used as a "greediness" modifier for a subpattern in a regular expression. By default, greediness is controlled by the Greedy property of the RegExOptions class, but can be overridden using the "?". You can place a "?" directly after a * or + to reverse the "greediness" setting. That is, if Greedy is True, using the ? after a * or + causes it to match the minimum number of times possible: For example, consider the following.

Target String Greedy Regular Expression Result
aaaa True (a+?) (a+) $1=a, $2=aaa
aaaa False (a+?) (a+) $1=aaa, $2=a

Extension Mechanism

We also support the regular expression extension mechanism used in Perl. For instance:

(?#text) Comment
(?:pattern) For grouping without creating backreferences
(?=pattern) A zero-width positive look-ahead assertion. For example, \w+(?=\t) matches a word followed by a tab, without including the tab in $&.
(?!pattern) A zero-width negative look-ahead assertion. For example foo(?!bar)/matches any occurrence of "foo" that isn't followed by "bar".
(?<=pattern) A zero-width positive look-behind assertion. For example, (?<=\t)\w+ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind.
(?<!pattern) A zero-width negative look-behind assertion. For example (?<!bar)foo matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind.

Subexpressions

You can use parentheses within your search patterns to isolate portions of the matched string. You do this when you need to refer to subsections of the matched in your replacement string. For example you would do this if you need to replace only a portion of the matched string or insert other text into the matched string.

Here is an example. If you want to match any date followed by the letters "B.C." you can use the pattern "\d+\sB\.C\." (Any number of digits followed by a space character, followed by the letters "B.C.") This will match dates such as 33 B.C., 1742 B.C., etc. However, if you wanted your replacement pattern to leave the year alone but replace the letters with something else, you would use parens. The search pattern "(\d+)\s(B\.C\.)" does this.

When you write your replacement pattern, you can refer to the year only with the variable \1 and the letters with \2.

If you write "(\d+)\s(B.C.|A.D.|BC|AD)", then \2 would contain the matched letters.

Combining Patterns

Much of the power of regular expressions comes from combining these elementary patterns to make up complex searches. Here are some examples:

Pattern Matches
\$?[0-9,]+\.?\d* Matches dollar amounts with an optional dollar sign.
\d+\sB\.C\. One or more digits followed by a space, followed by "B.C."

The Alternation Operator

The alternation operator (|) allows you to match any of a number of patterns using the logical "or" operator. Place it between two existing patterns to match either pattern. You can use more than one alternation operator in a pattern:

Pattern Matches
\she\s | \sshe\s " he " or " she "
cat|dog|possum "cat", "dog", or "possum"
([0-9,]+\sB\.C\.)|([0-9,]+\sA\.D\.) '''or''' [0-9,]+\s((B\.C\.)|(A\.D\.)) Years of the form "yearNum B.C. or A.D." e.g., "2,175 B.C." or "215 A.D."

Search and Replace

You use special patterns to represent the matched pattern. Using replacement patterns, you can append or prepend the matched pattern with other text.

Pattern Description
$& Contains the entire matched pattern.

If "\d\d\d\d\sB\.C\." finds "1541 B.C.", then the replacement pattern "the year $&" results in "the year 1541 B.C.", as the $& contains the string "1541 B.C".

\1, \2, etc. Contains the matched subpatterns, defined by use of parentheses in the search string.

The search pattern "(\d+)\s(B\.C\.|A\.D\.|BC|AD)" looks for any number of digits followed by a space character, followed by either "B.C.", "BC", "A.D.", or "AD". The \1 variable contains the match to the "\d+" portion of the expression and \2 contains the match to the "B\.C\.|A\.D\.|BC|AD" portion.

Credits

Xojo uses a modified version of the PCRE library package, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England.

The source to this library is available at: ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/

Examples

The following PushButton's Action event handler allows you to search the text in TextArea1 using the search pattern entered into TextField1 and display the results of the search in a Label:

Var rg As New RegEx
Var myMatch As RegExMatch
rg.SearchPattern = TextField1.Value
myMatch = rg.Search(TextArea1.Value)
If myMatch <> Nil Then
Label1.Value = myMatch.SubExpressionString(0)
Else
Label1.Value = "Text not found!"
End If
Exception err As RegExException
MessageBox(err.Message)

See Also

RegExMatch, RegExOptions classes; RegExException Error.