The standard notation for regular expressions dates back to a time when short, concise notations were preferred over lengthy, verbose ones. Such notational brevity, powerful though it is, can all too easily trigger panic in the uninitiated. In Perl, we can reduce the shock value of regex-induced punctuation overload with a few simple constructs.
The /x
modifier is the most important of these. It causes whitespace to be ignored
in the regex (well, except in a character class), and also allows you to
use normal comments there, too. As you can imagine, whitespace and comments
help significantly.
Another feature that can enhance legibility is selecting your own delimiters for matching or substitution. This way an unfortunate pattern afflicted with LTS (that's Leaning Toothpick Syndrome) can be written in a variety of ways:
if ( /^\/usr\/bin\/perl\b/ ) { ... } if ( m(^/usr/bin/perl\b) ) { ... } if ( m{^/usr/bin/perl\b} ) { ... } if ( m#^/usr/bin/perl\b# ) { ... }
For example, contrast this apparent hiccup from your modem:
s{(?:[^>'"]*|".*?"|'.*?')+>
with its legible rewrite derived the striphtml program described in the FAQ section on Networking:
s{ < # opening angle bracket (?: # Non-backreffing grouping paren [^>'"] * # 0 or more things that are neither > nor ' nor " | # or else ".*?" # a section between double quotes (stingy match) | # or else '.*?' # a section between single quotes (stingy match) ) + # all occurring one or more times > # closing angle bracket }{}gsx; # replace with nothing, i.e. delete
Ok, so it's still not quite so clear as prose, but at least now you have a chance of going back to it later and having a clue what you were trying to do.
You should also read the perlre manpage and decide which of /s and /m (or both) you might want to use: /s allow dot to include newline, and /m allows caret and dollar to match next to a newline, not just at the end of the string. You just need to make you've actually got a multiline string in there.
..
operator:
perl -ne 'print if /START/ .. /END/' file1 file2 ...
If you wanted text and not lines, you would use
perl -0777 -pe 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
But if you want nested occurrences of START
through END
, you'll run up against the problem described in the question in this
section on matching balanced text.
# Original by Nathan Torkington, massaged by Jeffrey Friedl # sub preserve_case($$) { my ($old, $new) = @_; my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new)); my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
for ($i = 0; $i < $len; $i++) { if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) { $state = 0; } elsif (lc $c eq $c) { substr($new, $i, 1) = lc(substr($new, $i, 1)); $state = 1; } else { substr($new, $i, 1) = uc(substr($new, $i, 1)); $state = 2; } } # finish up with any remaining new (for when new is longer than old) if ($newlen > $oldlen) { if ($state == 1) { substr($new, $oldlen) = lc(substr($new, $oldlen)); } elsif ($state == 2) { substr($new, $oldlen) = uc(substr($new, $oldlen)); } } return $new; }
$a = "this is a TEsT case"; $a =~ s/(test)/preserve_case($1, "success")/gie; print "$a\n";
This prints:
this is a SUcCESS
/[^\W\d_]/
, no matter what locale you're in. Non-alphabetics would be /[\W\d_]/
(assuming you don't consider an underscore a letter).
$string = "to die"; $lhs = "die"; $rhs = "sleep no more"; $string =~ s/$lhs/$rhs/; # $string is now "to sleep no more"
/o
modifier locks in the regex the first time it's used. This always happens
in a constant pattern, and in fact, the pattern was compiled into internal
format at the same time your entire program was. Use of /o
is irrelevant unless variable interpolation is used in the pattern, and if
so, the regex engine will neither know nor care whether the variables
change after the pattern is evaluated the very first time.
/o
is often used to gain an extra measure of efficiency by not performing
subsequent evaluations when you know it won't matter (because you know the
variables won't change), or more rarely, when you don't want the regex to
notice if they do.
For example, here's a ``paragrep'' program:
$/ = ''; # paragraph mode $pat = shift; while (<>) { print if /$pat/o; }
perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
will work in many but not all cases. You see, it's too simple-minded for certain kinds of C programs, in particular, those with what appear to be comments in quoted strings. For that, you'd need something like this, created by Jeffrey Friedl:
$/ = undef; $_ = <>; s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g; print;
This could, of course, be more legibly written with the /x
modifier, adding whitespace and comments.
\1
and its ilk). But they are not the proper tool for every nail. You still
need to use non-regex techniques to parse balanced text, such as the text
enclosed between matching parentheses or braces, for example.
An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and
possibly nested single chars, like `
and '
, {
and }
, or (
and )
can be found in CPAN/authors/id/TOMC/scripts/pull_quotes.gz .
The C::Scan module from CPAN also contains such subs for internal usage, but they are undocumented.
?
, *
, +
,
{}
) that are greedy rather than the whole pattern; Perl prefers local greed
and immediate gratification to overall greed. To get non-greedy versions of
the same quantifiers, use (??
, *?
, +?
, {}?
).
An example:
$s1 = $s2 = "I am very very cold"; $s1 =~ s/ve.*y //; # I am cold $s2 =~ s/ve.*?y //; # I am very cold
Notice how the second substitution stopped matching as soon as it
encountered ``y ''. The *?
quantifier effectively tells the regular expression engine to find a match
as quickly as possible and pass control on to whatever is next in line,
like you would if you were playing hot potato.
while (<FH>) { foreach $pat (@patterns) { if ( /$pat/ ) { # do something } } }
Instead, you either need to use one of the experimental Regexp extension modules from CPAN (which might well be overkill for you purposes), or else put together something like this, inspired from a routine in Jeffrey Friedl's book:
sub _bm_build { my $condition = shift; my @regex = @_; # this MUST not be local(); need my() my $expr = join $condition => map { "m/\$regex[$_]/o" } (0..$#regex); my $match_func = eval "sub { $expr }"; die if $@; # propagate $@; this shouldn't happen! return $match_func; }
sub bm_and { _bm_build('&&', @_) } sub bm_or { _bm_build('||', @_) }
$f1 = bm_and qw{ xterm (?i)window };
$f2 = bm_or qw{ \b[Ff]ree\b \bBSD\B (?i)sys(tem)?\s*[V5]\b };
# feed me /etc/termcap, prolly while ( <> ) { print "1: $_" if &$f1; print "2: $_" if &$f2; }
\G
is used with the /g
modifier (and ignored elsewhere)in a match or substitution, and sets an
anchor to just past where the last match occurred, i.e. the pos point.
For example, suppose you had line of text quoted in standard mail and
Usenet notation, (that is, with leading >
characters), and you want change each leading >
into a corresponding :
. You could do so in this way:
s/^(>+)/':' x length($1)/gem;
Or, using \G
, the much simpler (and faster):
s/\G>/:/g;
A more sophisticated use might involve a tokenizer. The following example, (courtesy of Jeffrey Friedl) did not work in 5.003 due to bugs, but does work in 5.004 or better:
while (<>) { chomp; PARSER: { m/ \G( \d+\b )/gx && do { print "number: $1\n"; redo; }; m/ \G( \w+ )/gx && do { print "word: $1\n"; redo; }; m/ \G( \s+ )/gx && do { print "space: $1\n"; redo; }; m/ \G( [^\w\d]+ )/gx && do { print "other: $1\n"; redo; }; } }
Of course, that could have been written as
while (<>) { chomp; PARSER: { if ( /\G( \d+\b )/gx { print "number: $1\n"; redo PARSER; } if ( /\G( \w+ )/gx { print "word: $1\n"; redo PARSER; } if ( /\G( \s+ )/gx { print "space: $1\n"; redo PARSER; } if ( /\G( [^\w\d]+ )/gx { print "other: $1\n"; redo PARSER; } } }
But then you lose the vertical alignment of the patterns.
for
(well, foreach
, technically) loop.
Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.
So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
Now, say you want to search for the single character /SG/. Perl doesn't know about Martian, so it'll find the two bytes ``SG'' in the ``I am CVSGXX!'' string, even though that character isn't there. It's a big problem.
Here are a few ways, all painful, to deal with it:
$martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``maritan'' bytes # are no longer adjacent. print "found SG!\n" if $martain =~ /SG/;
Or like this:
@chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; # above is conceptualy similar to: @chars = $text =~ m/(.)/g; # foreach $char (@chars) { print "found SG!\n", last if $char eq 'SG'; }
Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded print "found SG!\n", last if $1 eq 'SG'; }
Or like this:
die "sorry, Perl doesn't (yet) have Martian support )-:\n";
There are many double- (and multi-) byte encodings commonly used these days, including:
Big Five (Chinese) EUC-JP (Japanese) GB (Chinese) KS (Korean) SJIS (Microsoft Braindamage to Japanese) Unicode (various) Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.