![]() |
< Day Day Up > |
![]() |
5.7. Regular ExpressionsThe use of strings and expressions to perform pattern matching dates from the earliest programming languages. In the mid-1960s SNOBOL was designed for the express purpose of text and string manipulation. It influenced the subsequent development of the grep tool in the Unix environment that makes extensive use of regular expressions. Those who have worked with grep or Perl or other scripting languages will recognize the similarity in the .NET implementation of regular expressions. Pattern matching is based on the simple concept of applying a special pattern string to some text source in order to match an instance or instances of that pattern within the text. The pattern applied against the text is referred to as a regular expression, or regex, for short. Entire books have been devoted to the topic of regular expressions. This section is intended to provide the essential knowledge required to get you started using regular expressions in the .NET world. The focus is on using the Regex class, and creating regular expressions from the set of characters and symbols available for that purpose. The Regex ClassYou can think of the Regex class as the engine that evaluates regular expressions and applies them to target strings. It provides both static and instance methods that use regexes for text searching, extraction, and replacement. The Regex class and all related classes are found in the System.Text.RegularExpressions namespace. Syntax: Regex( string pattern ) Regex( string pattern, RegexOptions) Parameters:
Example: Regex r1 = new Regex(" "); // Regular expression is a blank String words[] = r1.Split("red blue orange yellow"); // Regular expression matches upper- or lowercase "at" Regex r2 = new Regex("at", RegexOptions.IgnoreCase); As the example shows, creating a Regex object is quite simple. The first parameter to its constructor is a regular expression. The optional second parameter is one or more (separated by |) RegexOptions enum values that control how the regex is applied. Regex MethodsThe Regex class contains a number of methods for pattern matching and text manipulation. These include IsMatch, Replace, Split, Match, and Matches. All have instance and static overloads that are similar, but not identical. Core Recommendation
Let's now examine some of the more important Regex methods. We'll keep the regular expressions simple for now because the emphasis at this stage is on understanding the methods梟ot regular expressions. IsMatch()This method matches the regular expression against an input string and returns a boolean value indicating whether a match is found. string searchStr = "He went that a way"; Regex myRegex = new Regex("at"); // instance methods bool match = myRegex.IsMatch(searchStr); // true // Begin search at position 12 in the string match = myRegex.IsMatch(searchStr,12); // false // Static Methods ?both return true match = Regex.IsMatch(searchStr,"at"); match = Regex.IsMatch(searchStr,"AT",RegexOptions.IgnoreCase); Replace()This method returns a string that replaces occurrences of a matched pattern with a specified replacement string. This method has several overloads that permit you to specify a start position for the search or control how many replacements are made. static Replace (string input, string pattern, string replacement [,RegexOptions]) Replace(string input, string replacement) Replace(string input, string replacement, int count) Replace(string input, string replacement, int count, int startat) The count parameter denotes the maximum number of matches; startat indicates where in the string to begin the matching process. There are also versions of this method梬hich you may want to explore further梩hat accept a MatchEvaluator delegate parameter. This delegate is called each time a match is found and can be used to customize the replacement process. Here is a code segment that illustrates the static and instance forms of the method: string newStr; newStr = Regex.Replace("soft rose","o","i"); // sift rise // instance method Regex myRegex = new Regex("o"); // regex = "o" // Now specify that only one replacement may occur newStr = myRegex.Replace("soft rose","i",1); // sift rose Split()This method splits a string at each point a match occurs and places that matching occurrence in an array. It is similar to the String.Split method, except that the match is based on a regular expression rather than a character or character string. Syntax: String[] Split(string input) String[] Split(string input, int count) String[] Split(string input, int count, int startat) Static String[] Split(string input, string pattern) Parameters:
This short example parses a string consisting of a list of artists' last names and places them in an array. A comma followed by zero or more blanks separates the names. The regular expression to match this delimiter string is: ",[ ]*". You will see how to construct this later in the section. string impressionists = "Manet,Monet, Degas, Pissarro,Sisley"; // Regex to match a comma followed by 0 or more spaces string patt = @",[ ]*"; // Static method string[] artists = Regex.Split(impressionists, patt); // Instance method is used to accept maximum of four matches Regex myRegex = new Regex(patt); string[] artists4 = myRegex.Split(impressionists, 4); foreach (string master in artists4) Console.Write(master); // Output --> "Manet" "Monet" "Degas" "Pissarro,Sisley" Match() and Matches()These related methods search an input string for a match to the regular expression. Match() returns a single Match object and Matches() returns the object MatchCollection, a collection of all matches. Syntax: Match Match(string input) Match Match(string input, int startat) Match Match(string input, int startat, int numchars) static Match(string input, string pattern, [RegexOptions]) The Matches method has similar overloads but returns a MatchCollection object. Match and Matches are the most useful Regex methods. The Match object they return is rich in properties that expose the matched string, its length, and its location within the target string. It also includes a Groups property that allows the matched string to be further broken down into matching substrings. Table 5-7 shows selected members of the Match class.
The following code demonstrates the use of these class members. Note that the dot (.) in the regular expression functions as a wildcard character that matches any single character. string verse = "In Xanadu did Kubla Khan"; string patt = ".an..."; // "." matches any character Match verseMatch = Regex.Match(verse, patt); Console.WriteLine(verseMatch.Value); // Xanadu Console.WriteLine(verseMatch.Index); // 3 // string newPatt = "K(..)"; //contains group(..) Match kMatch = Regex.Match(verse, newPatt); while (kMatch.Success) { Console.Write(kMatch.Value); // -->Kub -->Kha Console.Write(kMatch.Groups[1]); // -->ub -->ha kMatch = kMatch.NextMatch(); } This example uses NextMatch to iterate through the target string and assign each match to kMatch (if NextMatch is left out, an infinite loop results). The parentheses surrounding the two dots in newPatt break the pattern into groups without affecting the actual pattern matching. In this example, the two characters after K are assigned to group objects that are accessed in the Groups collection. Sometimes, an application may need to collect all of the matches before processing them梬hich is the purpose of the MatchCollection class. This class is just a container for holding Match objects and is created using the Regex.Matches method discussed earlier. Its most useful properties are Count, which returns the number of captures, and Item, which returns an individual member of the collection. Here is how the NextMatch loop in the previous example could be rewritten: string verse = "In Xanadu did Kubla Khan"; String newpatt = "K(..)"; foreach (Match kMatch in Regex.Matches(verse, newpatt)) Console.Write(kMatch.Value); // -->Kub -->Kha // Could also create explicit collection and work with it. MatchCollection mc = Regex.Matches(verse, newpatt); Console.WriteLine(mc.Count); // 2 Creating Regular ExpressionsThe examples used to illustrate the Regex methods have employed only rudimentary regular expressions. Now, let's explore how to create regular expressions that are genuinely useful. If you are new to the subject, you will discover that designing Regex patterns tends to be a trial-and-error process; and the endeavor can yield a solution of simple elegance梠r maddening complexity. Fortunately, almost all of the commonly used patterns can be found on one of the Web sites that maintain a searchable library of Regex patterns (www.regexlib.com is one such site). A regular expression can be broken down into four different types of metacharacters that have their own role in the matching process:
Table 5-8 summarizes the most frequently used patterns.
A Pattern Matching ExampleLet's apply these character patterns to create a regular expression that matches a Social Security Number (SSN): bool iMatch = Regex.IsMatch("245-09-8444", @"\d\d\d-\d\d-\d\d\d\d"); This is the most straightforward approach: Each character in the Social Security Number matches a corresponding pattern in the regular expression. It's easy to see, however, that simply repeating symbols can become unwieldy if a long string is to be matched. Repetition characters improve this: bool iMatch = Regex.IsMatch("245-09-8444", @"\d{3}-\d{2}-\d{4}"); Another consideration in matching the Social Security Number may be to restrict where it exists in the text. You may want to ensure it is on a line by itself, or at the beginning or end of a line. This requires using position characters at the beginning or end of the matching sequence. Let's alter the pattern so that it matches only if the Social Security Number exists by itself on the line. To do this, we need two characters: one to ensure the match is at the beginning of the line, and one to ensure that it is also at the end. According to Table 5-9, ^ and $ can be placed around the expression to meet these criteria. The new string is @"^\d{3}-\d{2}-\d{4}$"
These positional characters do not take up any space in the expression梩hat is, they indicate where matching may occur but are not involved in the actual matching process. As a final refinement to the SSN pattern, let's break it into groups so that the three sets of numbers separated by dashes can be easily examined. To create a group, place parentheses around the parts of the expression that you want to examine independently. Here is a simple code example that uses the revised pattern: string ssn = "245-09-8444"; string ssnPatt = @"^(\d{3})-(\d{2})-(\d{4})$"; Match ssnMatch = Regex.Match(ssn, ssnPatt); if (ssnMatch.Success){ Console.WriteLine(ssnMatch.Value); // 245-09-8444 Console.WriteLine(ssnMatch.Groups.Count); // 4 // Count is 4 since Groups[0] is set to entire SSN Console.Write(ssnMatch.Groups[1]); // 245 Console.Write(ssnMatch.Groups[2]); // 09 Console.Write(ssnMatch.Groups[3]); // 8444 } We now have a useful pattern that incorporates position, repetition, and group characters. The approach that was used to create this pattern梥tarted with an obvious pattern and refined it through multiple stages梚s a useful way to create complex regular expressions (see Figure 5-4). Figure 5-4. Regular expressionWorking with GroupsAs we saw in the preceding example, the text resulting from a match can be automatically partitioned into substrings or groups by enclosing sections of the regular expression in parentheses. The text that matches the enclosed pattern becomes a member of the Match.Groups[] collection. This collection can be indexed as a zero-based array: the 0 element is the entire match, element 1 is the first group, element 2 the second, and so on. Groups can be named to make them easier to work with. The name designator is placed adjacent to the opening parenthesis using the syntax ?<name>. To demonstrate the use of groups, let's suppose we need to parse a string containing the forecasted temperatures for the week (for brevity, only two days are included): string txt ="Monday Hi:88 Lo:56 Tuesday Hi:91 Lo:61"; The regex to match this includes two groups: day and temps. The following code creates a collection of matches and then iterates through the collection, printing the content of each group: string rgPatt = @"(?<day>[a-zA-Z]+)\s*(?<temps>Hi:\d+\s*Lo:\d+)"; MatchCollection mc = Regex.Matches(txt, rgPatt); //Get matches foreach(Match m in mc) { Console.WriteLine("{0} {1}", m.Groups["day"],m.Groups["temps"]); } //Output: Monday Hi:88 Lo:56 // Tuesday Hi:91 Lo:61 Core Note
Backreferencing a GroupIt is often useful to create a regular expression that includes matching logic based on the results of previous matches within the expression. For example, during a grammatical check, word processors flag any word that is a repeat of the preceding word(s). We can create a regular expression to perform the same operation. The secret is to define a group that matches a word and then uses the matched value as part of the pattern. To illustrate, consider the following code: string speech = "Four score and and seven years"; patt = @"(\b[a-zA-Z]+\b)\s\1"; // Match repeated words MatchCollection mc = Regex.Matches(speech, patt); foreach(Match m in mc) { Console.WriteLine(m.Groups[1]); // --> and } This code matches only the repeated words. Let's examine the regular expression:
A group can also be referenced by name rather than number. The syntax for this backreference is \k followed by the group name enclosed in <>:
patt = @"(?<word>\b[a-zA-Z]+\b)\s\k<word>";
Examples of Using Regular ExpressionsThis section closes with a quick look at some patterns that can be used to handle common pattern matching challenges. Two things should be clear from these examples: There are virtually unlimited ways to create expressions to solve a single problem, and many pattern matching problems involve nuances that are not immediately obvious. Using Replace to Reverse Wordsstring userName = "Claudel, Camille"; userName = Regex.Replace( userName, @"(\w+),\s*(\w+)", "$2 $1" ); Console.WriteLine(userName); // Camille Claudel The regular expression assigns the last and first name to groups 1 and 2. The third parameter in the Replace method allows these groups to be referenced by placing $ in front of the group number. In this case, the effect is to replace the entire matched name with the match from group 2 (first name) followed by the match from group 1 (last name). Parsing NumbersString myText = "98, 98.0, +98.0, +98"; string numPatt = @"\d+"; // Integer numPatt = @"(\d+\.?\d*)|(\.\d+)"; // Allow decimal numPatt = @"([+-]?\d+\.?\d*)|([+-]?\.\d+)"; // Allow + or - Note the use of the OR (|) symbol in the third line of code to offer alternate patterns. In this case, it permits an optional number before the decimal. The following code uses the ^ character to anchor the pattern to the beginning of the line. The regular expression contains a group that matches four bytes at a time. The * character causes the group to be repeated until there is nothing to match. Each time the group is applied, it captures a 4-digit hex number that is placed in the CaptureCollection object. string hex = "00AA001CFF0C"; string hexPatt = @"^(?<hex4>[a-fA-F\d]{4})*"; Match hexMatch = Regex.Match(hex,hexPatt); Console.WriteLine(hexMatch.Value); // --> 00AA001CFFOC CaptureCollection cc = hexMatch.Groups["hex4"].Captures; foreach (Capture c in cc) Console.Write(c.Value); // --> 00AA 001C FF0C Figure 5-5 shows the hierarchical relationship among the Match, GroupCollection, and CaptureCollection classes. Figure 5-5. Hex numbers captured by regular expression |
![]() |
< Day Day Up > |
![]() |