6.4 Regular Expression Basics

Let's start with simple expressions using the Regex and the Match classes.

Match m = Regex.Match("abracadabra", "(a|b|r)+");

This results in an instance of the Match class that can be tested for success without examining the contents of the matched string, as follows:

if (m.Success) {
  ...
}

To use the matched substring, simply convert it to a string:

Console.WriteLine("Match="+m.ToString( ));

The output of this example is the portion of the string that has been successfully matched, as follows:

Match=abra

Simple string replacements are also very straightforward. For example, consider the following statement:

string s = Regex.Replace("abracadabra", "abra", "zzzz");

This returns the string zzzzcadzzzz, in which all occurrences of the matching pattern are replaced by the replacement string zzzzz.

Now let's look at a more complex expression:

string s = Regex.Replace("  abra  ", @"^\s*(.*?)\s*$", "$1");

This returns the string abra, with preceding and trailing spaces removed. This pattern is generally useful for removing leading and trailing spaces from any string. We could also have used the literal string quote construct in C#. Within a literal string, the compiler does not process the backslash character (\) as an escape character. Consequently, the @"..." is very useful when working with regular expressions, and when you are specifying escaped metacharacters with a \. Also of note is the use of $1 as the replacement string. The replacement string can contain only substitutions, which are references to capture groups in the regular expression.

Now let's try a slightly more complex sample by doing a walk-through of a grouping structure. For example:

string text = "abracadabra1abracadabra2abracadabra3";
string pat = @"
    (      # start the first group
      abra# match the literal 'abra'
      (# start the second (inner) group
      cad# match the literal 'cad'
      )?# end the second (optional) group
    )# end the first group
    +# match one or more occurences
    ";
// create a new regex that ignores comments
Regex r = new Regex(pat, RegexOptions.IgnorePatternWhitespace);
// get the list of group numbers
int[ ] gnums = r.GetGroupNumbers( );
// get first match
Match m = r.Match(text);
while (m.Success) {
// start at group 1
  for (int i = 1; i < gnums.Length; i++) {
    Group g = m.Group(gnums[i]);
// get the group for this match
    Console.WriteLine("Group"+gnums[i]+"=["+g.ToString( )+"]");
// get caps for this group
    CaptureCollection cc = g.Captures;
    for (int j = 0; j < cc.Count; j++) {
      Capture c = cc[j];
      Console.WriteLine("Capture" + j + "=["+c.ToString( ) 
         + "] Index=" + c.Index + " Length=" + c.Length);
    }
  }
// get next match
  m = m.NextMatch( );
}

The output of this example is:

Group1=[abra]
        Capture0=[abracad] Index=0 Length=7
        Capture1=[abra] Index=7 Length=4
Group2=[cad]
        Capture0=[cad] Index=4 Length=3
Group1=[abra]
        Capture0=[abracad] Index=12 Length=7
        Capture1=[abra] Index=19 Length=4
Group2=[cad]
        Capture0=[cad] Index=16 Length=3
Group1=[abra]
        Capture0=[abracad] Index=24 Length=7
        Capture1=[abra] Index=31 Length=4
Group2=[cad]
        Capture0=[cad] Index=28 Length=3

We'll first examine the string pat, which contains the regular expression. The first capture group is marked by the first parenthesis, and then the expression matches an abra, if the regex engine matches the expression to that found in the text. Then the second capture group begins, marked by the second parenthesis, but the definition of the first capture group is still ongoing. What this means is that the first group must match abracad and the second group matches the cad. So, if we decided to make the cad match an optional occurrence with the ? metacharacter, then abra or abracad is matched. Next, we end the first group, and ask the expression to match one or more occurrences by specifying the + metacharacter.

During the matching process we create an instance of the expression by calling the Regex constructor, which is also where you specify your options. In this case, we used the RegexOptions.IgnorePatternWhitespace option, as the regular expression itself includes comments and whitespace for formatting purposes. The RegexOptions.IgnorePatternWhitespace option instructs the regex engine to ignore both the comments and all the whitespace that is not explicitly escaped.

Next, we retrieve the list of group numbers (gnums) defined in this regular expression. Although this could have been done explicitly, this sample demonstrates a programmatic approach. This approach is also useful if we have specified named groups as a way of quickly indexing through the set of groups.

Then we perform the first match and enter a loop to test for success of the current match. The next step is to iterate through the list of groups starting at group 1. The reason we do not use group 0 in this sample is that group 0 is the fully captured match string, and what we usually (but not always) want to pick out of a string is a subgroup. You might use group 0 if you want to collect the fully matched string as a single string.

Within each group, we iterate through the CaptureCollection. There is usually only one capture per match, per group—in this case two captures show for Group1: Capture0 and Capture1. And if we ask only for the ToString of Group1, we receive abra, although it does also match the abracad sub string. The group ToString value is the value of the last Capture in its CaptureCollection. This is the expected behavior, and if we want the match to stop after just the abra, we can remove the + from the expression, telling the regular expression engine to match on just the expression.

[ Team LiB ]