6.4 Regular Expression Basics
Let's start with simple
expressions using the
Regex and the Match classes.
Match m = Regex.Match("abracadabra", "(a|b|r)+");
This results in an instance of the Match class
that can be tested for success without examining the contents of the
matched string, as follows:
if (m.Success) {
...
}
To use the matched substring, simply convert it to a string:
Console.WriteLine("Match="+m.ToString( ));
The output of this example is the portion of the string that has been
successfully matched, as follows:
Match=abra
Simple string replacements are also very straightforward. For
example, consider the following statement:
string s = Regex.Replace("abracadabra", "abra", "zzzz");
This returns the string zzzzcadzzzz, in which all
occurrences of the matching pattern are replaced by the replacement
string zzzzz.
Now let's look at a more complex expression:
string s = Regex.Replace(" abra ", @"^\s*(.*?)\s*$", "$1");
This returns the string abra, with preceding and
trailing spaces removed. This pattern is generally useful for
removing leading and trailing spaces from any string. We could also
have used the literal string quote construct in C#. Within a literal
string, the compiler does not process the backslash character
(\) as an escape character. Consequently, the
@"..." is very useful when working with regular
expressions, and when you are specifying escaped metacharacters with
a \. Also of note is the use of
$1 as the replacement string. The replacement
string can contain only substitutions, which are references to
capture groups in the regular expression.
Now let's try a slightly more complex sample by
doing a walk-through of a grouping structure. For example:
string text = "abracadabra1abracadabra2abracadabra3";
string pat = @"
( # start the first group
abra# match the literal 'abra'
(# start the second (inner) group
cad# match the literal 'cad'
)?# end the second (optional) group
)# end the first group
+# match one or more occurences
";
// create a new regex that ignores comments
Regex r = new Regex(pat, RegexOptions.IgnorePatternWhitespace);
// get the list of group numbers
int[ ] gnums = r.GetGroupNumbers( );
// get first match
Match m = r.Match(text);
while (m.Success) {
// start at group 1
for (int i = 1; i < gnums.Length; i++) {
Group g = m.Group(gnums[i]);
// get the group for this match
Console.WriteLine("Group"+gnums[i]+"=["+g.ToString( )+"]");
// get caps for this group
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++) {
Capture c = cc[j];
Console.WriteLine("Capture" + j + "=["+c.ToString( )
+ "] Index=" + c.Index + " Length=" + c.Length);
}
}
// get next match
m = m.NextMatch( );
}
The output of this example is:
Group1=[abra]
Capture0=[abracad] Index=0 Length=7
Capture1=[abra] Index=7 Length=4
Group2=[cad]
Capture0=[cad] Index=4 Length=3
Group1=[abra]
Capture0=[abracad] Index=12 Length=7
Capture1=[abra] Index=19 Length=4
Group2=[cad]
Capture0=[cad] Index=16 Length=3
Group1=[abra]
Capture0=[abracad] Index=24 Length=7
Capture1=[abra] Index=31 Length=4
Group2=[cad]
Capture0=[cad] Index=28 Length=3
We'll first examine the string
pat, which contains the regular expression. The
first capture group is marked by the first parenthesis, and then the
expression matches an abra, if the regex engine
matches the expression to that found in the text. Then the second
capture group begins, marked by the second parenthesis, but the
definition of the first capture group is still ongoing. What this
means is that the first group must match abracad
and the second group matches the cad. So, if we
decided to make the cad match an optional
occurrence with the ? metacharacter, then
abra or abracad is matched.
Next, we end the first group, and ask the expression to match one or
more occurrences by specifying the +
metacharacter.
During the matching process we create an instance of the expression
by calling the Regex constructor, which is also
where you specify your options. In this case, we used the
RegexOptions.IgnorePatternWhitespace option, as
the regular expression itself includes comments and whitespace for
formatting purposes. The
RegexOptions.IgnorePatternWhitespace option
instructs the regex engine to ignore both the comments and all the
whitespace that is not explicitly escaped.
Next, we retrieve the list of group numbers
(gnums) defined in this regular expression.
Although this could have been done explicitly, this sample
demonstrates a programmatic approach. This approach is also useful if
we have specified named groups as a way of quickly indexing through
the set of groups.
Then we perform the first match and enter a loop to test for success
of the current match. The next step is to iterate through the list of
groups starting at group 1. The reason we do not use group 0 in this
sample is that group 0 is the fully captured match string, and what
we usually (but not always) want to pick out of a string is a
subgroup. You might use group 0 if you want to collect the fully
matched string as a single string.
Within each group, we iterate through the
CaptureCollection. There is usually only one
capture per match, per group梚n this case two captures show for
Group1: Capture0 and
Capture1. And if we ask only for the
ToString of Group1, we receive
abra, although it does also match the
abracad sub string. The group
ToString value is the value of the last
Capture in its
CaptureCollection. This is the expected behavior,
and if we want the match to stop after just the
abra, we can remove the + from
the expression, telling the regular expression engine to match
on just
the expression.
|