12.11. PropertiesExplicit character classes are frequently used to match character ranges, especially alphabetics. For example:
# Alphabetics-only identifier...
Readonly my $ALPHA_IDENT => qr/ [A-Z] [A-Za-z]* /xms;However, a character class like that doesn't actually match all possible alphabetics. It matches only ASCII alphabetics. It won't recognize the common Latin-1 variants, let alone the full gamut of Unicode alphabetics. That result might be okay, if you're sure your data will never be other than parochial, but in today's post-modern, multicultural, outsourced world it's rather déclassé for an überhacking r Regular expressions in Perl 5.6 and later[*] support the use of the \p{...} escape, which allows you to use full Unicode properties. Properties are Unicode-compliant named character classes and are both more general and more self-documenting than explicit ASCII character classes. The perlunicode manpage explains the mechanism in detail and lists the available properties.
So, if you're ready to concede that ASCII-centrism is a naïve façade that's gradually fading into Götterdämmerung, you might choose to bid it adiós and open your regexes to the full Unicode smörgåsbord, by changing the previous identifier regex to:
Readonly my $ALPHA_IDENT => qr/ \p{Uppercase} \p{Alphabetic}* /xms;There are even properties to help create identifiers that follow the normal Perl conventions but are still language-independent. Instead of:
Readonly my $PERL_IDENT => qr/ [A-Za-z_] \w*/xms;you can use:
Readonly my $PERL_IDENT => qr/ \p{ID_Start} \p{ID_Continue}* /xms;One other particularly useful property is \p{Any}, which provides a more readable alternative to the normal dot (.) metacharacter. For example, instead of:
m/ [{] . [.] \d{2} [}] /xms;you could write:
m/ [{] \p{Any} [.] \d{2} [}] /xms;and leave the reader in no doubt that the second character to be matched really can be anything at allan ASCII alphabetic, a Latin-1 superscript, an Extended Latin diacritical, a Devanagari number, an Ogham rune, or even a Bopomofo symbol. |