12.11. PropertiesExplicit character classes are frequently used to match character ranges, especially alphabetics. For example: # Alphabetics-only identifier... Readonly my $ALPHA_IDENT => qr/ [A-Z] [A-Za-z]* /xms; However, a character class like that doesn't actually match all possible alphabetics. It matches only ASCII alphabetics. It won't recognize the common Latin-1 variants, let alone the full gamut of Unicode alphabetics. That result might be okay, if you're sure your data will never be other than parochial, but in today's post-modern, multicultural, outsourced world it's rather déclassé for an überhacking r Regular expressions in Perl 5.6 and later[*] support the use of the \p{...} escape, which allows you to use full Unicode properties. Properties are Unicode-compliant named character classes and are both more general and more self-documenting than explicit ASCII character classes. The perlunicode manpage explains the mechanism in detail and lists the available properties.
So, if you're ready to concede that ASCII-centrism is a naïve façade that's gradually fading into Götterdämmerung, you might choose to bid it adiós and open your regexes to the full Unicode smörgåsbord, by changing the previous identifier regex to:
Readonly my $ALPHA_IDENT => qr/ \p{Uppercase} \p{Alphabetic}* /xms; There are even properties to help create identifiers that follow the normal Perl conventions but are still language-independent. Instead of: Readonly my $PERL_IDENT => qr/ [A-Za-z_] \w*/xms; you can use:
Readonly my $PERL_IDENT => qr/ \p{ID_Start} \p{ID_Continue}* /xms; One other particularly useful property is \p{Any}, which provides a more readable alternative to the normal dot (.) metacharacter. For example, instead of:
m/ [{] . [.] \d{2} [}] /xms; you could write:
m/ [{] \p{Any} [.] \d{2} [}] /xms; and leave the reader in no doubt that the second character to be matched really can be anything at allan ASCII alphabetic, a Latin-1 superscript, an Extended Latin diacritical, a Devanagari number, an Ogham rune, or even a Bopomofo symbol. |