NAME

perlfaq6 - Regexps ($Revision: 1.4 $)

DESCRIPTION

This section is surprisingly small because the rest of the FAQ is littered with answers involving patterns. For example, decoding a URL and checking whether something is a number are handled with regular expressions, but those answers are found elsewhere in this document (in the section on Data and the Networking one on networking, to be precise).

How can I hope to use regular expressions without creating illegible and unmaintainable code?

No programming language to date stops you from writing unreadable and unmaintainable code. On the other hand, a skillful programmer can write readable code in any language.

The standard notation for regular expressions dates back to a time when short, concise notations were preferred over lengthy, verbose ones. Such notational brevity, powerful though it is, can all too easily trigger panic in the uninitiated. In Perl, we can reduce the shock value of regex-induced punctuation overload with a few simple constructs.

The /x modifier is the most important of these. It causes whitespace to be ignored in the regex (well, except in a character class), and also allows you to use normal comments there, too. As you can imagine, whitespace and comments help significantly.

Another feature that can enhance legibility is selecting your own delimiters for matching or substitution. This way an unfortunate pattern afflicted with LTS (that's Leaning Toothpick Syndrome) can be written in a variety of ways:

        if (  /^\/usr\/bin\/perl\b/ ) { ... }
        if ( m(^/usr/bin/perl\b)    ) { ... }
        if ( m{^/usr/bin/perl\b}    ) { ... }
        if ( m#^/usr/bin/perl\b#    ) { ... }

For example, contrast this apparent hiccup from your modem:

    s{(?:[^>'"]*|".*?"|'.*?')+>

with its legible rewrite derived the striphtml program described in the FAQ section on Networking:

    s{ <                    # opening angle bracket
        (?:                 # Non-backreffing grouping paren
             [^>'"] *       # 0 or more things that are neither > nor ' nor "
                |           #    or else
             ".*?"          # a section between double quotes (stingy match)
                |           #    or else
             '.*?'          # a section between single quotes (stingy match)
        ) +                 #   all occurring one or more times
       >                    # closing angle bracket
    }{}gsx;                 # replace with nothing, i.e. delete

Ok, so it's still not quite so clear as prose, but at least now you have a chance of going back to it later and having a clue what you were trying to do.

I'm having trouble matching over more than one line. What's wrong?

You may not have more than one line in your target, or you may not have told Perl to treat your target as having more than one line. If you intend to do a multiline match, first be sure that you actually have a multiline string! There are many ways to get multiline data into a string. If you want it to happen automatically while reading input, you'll want to set $/ (probably to '' or undef) to allow you to read more than one line at a time.

You should also read the perlre manpage and decide which of /s and /m (or both) you might want to use: /s allow dot to include newline, and /m allows caret and dollar to match next to a newline, not just at the end of the string. You just need to make you've actually got a multiline string in there.

How can I pull out lines between two patterns that are themselves on different lines?

Here's one way, using Perl's somewhat exotic .. operator:

    perl -ne 'print if /START/ .. /END/' file1 file2 ...

If you wanted text and not lines, you would use

    perl -0777 -pe 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...

But if you want nested occurrences of START through END, you'll run up against the problem described in the question in this section on matching balanced text.

I put a pattern into $/ but it didn't work. What's wrong?

$/ must be a string, not a pattern. Awk has to be better for something. :-)

How do I substitute case insensitively on LHS, but preserving case on the RHS?

It depends on what you mean by ``preserving case''. The following script makes the substitution have the same case, letter by letter, as the original. If the substitution has more characters than the string being substituted, the case of the last character is used for the rest of the substitution.

    # Original by Nathan Torkington, massaged by Jeffrey Friedl
    #
    sub preserve_case($$)
    {
        my ($old, $new) = @_;
        my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
        my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
        my ($len) = $oldlen < $newlen ? $oldlen : $newlen;

        for ($i = 0; $i < $len; $i++) {
            if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
                $state = 0;
            } elsif (lc $c eq $c) {
                substr($new, $i, 1) = lc(substr($new, $i, 1));
                $state = 1;
            } else {
                substr($new, $i, 1) = uc(substr($new, $i, 1));
                $state = 2;
            }
        }
        # finish up with any remaining new (for when new is longer than old)
        if ($newlen > $oldlen) {
            if ($state == 1) {
                substr($new, $oldlen) = lc(substr($new, $oldlen));
            } elsif ($state == 2) {
                substr($new, $oldlen) = uc(substr($new, $oldlen));
            }
        }
        return $new;
    }

    $a = "this is a TEsT case";
    $a =~ s/(test)/preserve_case($1, "success")/gie;
    print "$a\n";

This prints:

    this is a SUcCESS

How can I make C<\w> match accented characters?

See the perllocale manpage.

How can I match a locale-smart version of C?

One alphabetic character would be /[^\W\d_]/, no matter what locale you're in. Non-alphabetics would be /[\W\d_]/ (assuming you don't consider an underscore a letter).

How can I quote a variable to use in a regex?

The Perl parser will expand $variable and @variable references in patterns unless the pattern delimiter is a single quote. Remember, too, that the right-hand side of a s/// substitution is considered a double-quoted string (see the perlop manpage for more details). Example:

        $string = "to die";
        $lhs = "die";
        $rhs = "sleep no more";
        
        $string =~ s/$lhs/$rhs/;
        # $string is now "to sleep no more"

What is C really for?

Under the current implementation, using a variable in a pattern match forces a re-evaluation (and perhaps recompilation) each time through. The /o modifier locks in the regex the first time it's used. This always happens in a constant pattern, and in fact, the pattern was compiled into internal format at the same time your entire program was. Use of /o is irrelevant unless variable interpolation is used in the pattern, and if so, the regex engine will neither know nor care whether the variables change after the pattern is evaluated the very first time.

/o is often used to gain an extra measure of efficiency by not performing subsequent evaluations when you know it won't matter (because you know the variables won't change), or more rarely, when you don't want the regex to notice if they do.

For example, here's a ``paragrep'' program:

    $/ = '';  # paragraph mode
    $pat = shift;
    while (<>) {
        print if /$pat/o;
    }

How do I use a regular expression to strip C style comments from a file?

While this actually can be done, it's much harder than you'd think. For example, this one-liner

    perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c

will work in many but not all cases. You see, it's too simple-minded for certain kinds of C programs, in particular, those with what appear to be comments in quoted strings. For that, you'd need something like this, created by Jeffrey Friedl:

    $/ = undef;
    $_ = <>;
    s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
    print;

This could, of course, be more legibly written with the /x modifier, adding whitespace and comments.

Can I use Perl regular expressions to match balanced text?

No, regular expressions just aren't powerful enough. Perl regular expressions aren't strictly ``regular'', mathematically speaking, because they feature conveniences like backreferences (\1 and its ilk). But they are not the proper tool for every nail. You still need to use non-regex techniques to parse balanced text, such as the text enclosed between matching parentheses or braces, for example.

An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and possibly nested single chars, like ` and ', { and }, or ( and ) can be found in CPAN/authors/id/TOMC/scripts/pull_quotes.gz .

The C::Scan module from CPAN also contains such subs for internal usage, but they are undocumented.

What does it mean that regexes are greedy? How can I get around it?

Most people mean that greedy regexes match as much as they can. Technically speaking, it's actually the quantifiers (?, *, +, {}) that are greedy rather than the whole pattern; Perl prefers local greed and immediate gratification to overall greed. To get non-greedy versions of the same quantifiers, use (??, *?, +?, {}?).

An example:

        $s1 = $s2 = "I am very very cold";
        $s1 =~ s/ve.*y //;      # I am cold
        $s2 =~ s/ve.*?y //;     # I am very cold

Notice how the second substitution stopped matching as soon as it encountered ``y ''. The *? quantifier effectively tells the regular expression engine to find a match as quickly as possible and pass control on to whatever is next in line, like you would if you were playing hot potato.

How can I do approximate matching?

See the module String::Approx available from CPAN.

How do I efficiently match many patterns at once?

The following is super-inefficient:

    while (<FH>) {
        foreach $pat (@patterns) {
            if ( /$pat/ ) {
                # do something
            }
        }
    }

Instead, you either need to use one of the experimental Regexp extension modules from CPAN (which might well be overkill for you purposes), or else put together something like this, inspired from a routine in Jeffrey Friedl's book:

    sub _bm_build {
        my $condition = shift;
        my @regex = @_;  # this MUST not be local(); need my()
        my $expr = join $condition => map { "m/\$regex[$_]/o" } (0..$#regex);
        my $match_func = eval "sub { $expr }";
        die if $@;  # propagate $@; this shouldn't happen!
        return $match_func;
    }

    sub bm_and { _bm_build('&&', @_) }
    sub bm_or  { _bm_build('||', @_) }

    $f1 = bm_and qw{
            xterm
            (?i)window
    };

    $f2 = bm_or qw{
            \b[Ff]ree\b
            \bBSD\B
            (?i)sys(tem)?\s*[V5]\b
    };

    # feed me /etc/termcap, prolly
    while ( <> ) {
        print "1: $_" if &$f1;
        print "2: $_" if &$f2;
    }

Why does using $&, $`, or $' slow my program down?

Because once Perl sees that you need one of these variables anywhere in the program, it has to provide them on each and every pattern match. The same mechanism that handles these provides for the use of $1, $2, etc., so you pay the same price for each regex that contains capturing parentheses. But if you never use $&, etc., in your script, then regexes without capturing parens won't be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price.

What good is C<\G> in a pattern?

The notation \G is used with the /g modifier (and ignored elsewhere)in a match or substitution, and sets an anchor to just past where the last match occurred, i.e. the pos point.

For example, suppose you had line of text quoted in standard mail and Usenet notation, (that is, with leading > characters), and you want change each leading > into a corresponding :. You could do so in this way:

     s/^(>+)/':' x length($1)/gem;

Or, using \G, the much simpler (and faster):

    s/\G>/:/g;

A more sophisticated use might involve a tokenizer. The following example, (courtesy of Jeffrey Friedl) did not work in 5.003 due to bugs, but does work in 5.004 or better:

    while (<>) {
      chomp;
      PARSER: {
           m/ \G( \d+\b    )/gx     && do { print "number: $1\n";  redo; };
           m/ \G( \w+      )/gx     && do { print "word:   $1\n";  redo; };
           m/ \G( \s+      )/gx     && do { print "space:  $1\n";  redo; };
           m/ \G( [^\w\d]+ )/gx     && do { print "other:  $1\n";  redo; };
      }
    }

Of course, that could have been written as

    while (<>) {
      chomp;
      PARSER: {
	   if ( /\G( \d+\b    )/gx  { 
		print "number: $1\n";
		redo PARSER;
	   }
	   if ( /\G( \w+      )/gx  {
		print "word: $1\n";
		redo PARSER;
	   }
	   if ( /\G( \s+      )/gx  {
		print "space: $1\n";
		redo PARSER;
	   }
	   if ( /\G( [^\w\d]+ )/gx  {
		print "other: $1\n";
		redo PARSER;
	   }
      }
    }

But then you lose the vertical alignment of the patterns.

Are Perl regexes DFAs or NFAs? Are they POSIX compliant?

While it's true that Perl's patterns resemble the DFAs (deterministic finite automata) of the egrep program, they are in fact implemented as NFAs (non-deterministic finite automata) to allow backtracking and backreferencing. And they aren't POSIX-style either, because those guarantee worst-case behavior for all cases. (It seems that some people prefer guarantees of consistency, even when what's guaranteed is slowness.) See the book ``Mastering Regular Expressions'' (from O'Reilly) by Jeffrey Friedl for all the details you could ever hope to know on these matters.

What's wrong with using grep or map in a void context?

Strictly speaking, nothing. Stylistically speaking, it's not a good way to write maintainable code. That's because you're using these constructs not for their return values but rather for their side-effects, and side-effects can be mystifying. There's no void grep that's not better written as a for (well, foreach, technically) loop.

How can I match strings with multi-byte characters?

This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, who has an article in issue #5 of The Perl Journal talking about this very matter.

Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.

So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

Now, say you want to search for the single character /SG/. Perl doesn't know about Martian, so it'll find the two bytes ``SG'' in the ``I am CVSGXX!'' string, even though that character isn't there. It's a big problem.

Here are a few ways, all painful, to deal with it:

   $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``maritan'' bytes
                                      # are no longer adjacent.
   print "found SG!\n" if $martain =~ /SG/;

Or like this:

   @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
   # above is conceptualy similar to:     @chars = $text =~ m/(.)/g;
   #
   foreach $char (@chars) {
       print "found SG!\n", last if $char eq 'SG';
   }

Or like this:

   while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
       print "found SG!\n", last if $1 eq 'SG';	
   }

Or like this:

   die "sorry, Perl doesn't (yet) have Martian support )-:\n";

There are many double- (and multi-) byte encodings commonly used these days, including:

   Big Five (Chinese)
   EUC-JP (Japanese)
   GB (Chinese)
   KS (Korean)
   SJIS (Microsoft Braindamage to Japanese)
   Unicode (various)
				     
Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.