/usr/src/perl/pod/perlfaq6/How_can_I_match_strings_with

How can I match strings with multi-byte characters?

This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, who has an article in issue #5 of The Perl Journal talking about this very matter.

Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.

So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

Now, say you want to search for the single character /SG/. Perl doesn't know about Martian, so it'll find the two bytes ``SG'' in the ``I am CVSGXX!'' string, even though that character isn't there. It's a big problem.

Here are a few ways, all painful, to deal with it:

   $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``maritan'' bytes
                                      # are no longer adjacent.
   print "found SG!\n" if $martain =~ /SG/;

Or like this:

   @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
   # above is conceptualy similar to:     @chars = $text =~ m/(.)/g;
   #
   foreach $char (@chars) {
       print "found SG!\n", last if $char eq 'SG';
   }

Or like this:

   while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
       print "found SG!\n", last if $1 eq 'SG';	
   }

Or like this:

   die "sorry, Perl doesn't (yet) have Martian support )-:\n";

There are many double- (and multi-) byte encodings commonly used these days, including:

   Big Five (Chinese)
   EUC-JP (Japanese)
   GB (Chinese)
   KS (Korean)
   SJIS (Microsoft Braindamage to Japanese)
   Unicode (various)
				     
Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.

Back to What's wrong with using grep or map in a void context?
Forward to AUTHOR AND COPYRIGHT
Up to the perlfaq6 manpage