使用Perl正则表达式过滤MIDDLE DOT Unicode字符的正确语法是什么?

时间:2013-12-01 00:32:22

标签: regex perl unicode

我正在尝试找出正确的语法来过滤字符串中的MIDDLE DOT Unicode字符(U+00B7)并保留原始字符串

     $_ =~ s/test_of_character (.*[^\x{00b7}])/$1/gi;

从上面的代码中,我不确定在从字符串中删除中间点之前如何保留原始字符串。

3 个答案:

答案 0 :(得分:5)

要从字符串中删除所有 Unicode MIDDLE DOT字符,您可以编写

s/\N{MIDDLE DOT}//g

tr/\N{MIDDLE DOT}//d

我不清楚“保留原始字符串”的含义,但如果您想保持$_不变,请从副本中删除MIDDLE DOT个字符然后你可以写

(my $modified = $_) =~ s/\N{MIDDLE DOT}//g

my $modified = s/\N{MIDDLE DOT}//gr

答案 1 :(得分:3)

如果您使用的是Perl和Unicode,则应阅读手册,例如:

第一个显示您可以使用符号编写Unicode代码点,例如U + 00B7:

\N{U+00B7}

您还可以使用Unicode字符名称:

\N{MIDDLE DOT}

其余的是基本regex处理。如果您需要保留原始字符串,那么如果Perl足够现代(添加到Perl 5.14.0),则可以使用/r修饰符作为正则表达式。或者(对于旧版本的Perl),您可以复制字符串并编辑副本,如下面的$altans所示。

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'unicode_strings';
use utf8;

binmode(STDOUT, ":utf8");

my $string = "This is some text with a ·•· middle dot or four \N{U+00B7}\N{MIDDLE DOT} in it";

print "string = $string\n";

my $answer = ($string =~ s/\N{MIDDLE DOT}//gr);
my $altans;

($altans = $string) =~ s/\N{U+00B7}//g;

# Fix grammar!
$answer =~ s/\ba\b/no/;
$answer =~ s/ or four //;

print "string = $string\n";
print "answer = $answer\n";
print "altans = $altans\n";

输出:

string = This is some text with a ·•· middle dot or four ·· in it
string = This is some text with a ·•· middle dot or four ·· in it
answer = This is some text with no • middle dot in it
altans = This is some text with a • middle dot or four  in it

请注意,“大中间点”是U + 2022,BULLET。


ikegami指出comment

  

请注意,\x{00B7}\xB7会与\N{U+00B7}匹配相同的字符。

事实上,就是这种情况,因为上面代码的扩展显示了:

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'unicode_strings';
use utf8;

binmode(STDOUT, ":utf8");

my $string = "This is some text with a ·•· middle dot or four \N{U+00B7}\N{MIDDLE DOT} in it";

print "string = $string\n";

my $answer = ($string =~ s/\N{MIDDLE DOT}//gr);
my $altans;

($altans = $string) =~ s/\N{U+00B7}//g;

# Fix grammar!
$answer =~ s/\ba\b/no/;
$answer =~ s/ or four //;

print "string = $string\n";
print "answer = $answer\n";
print "altans = $altans\n";

my $extan1 = $string;
$extan1 =~ s/\xB7//g;
print "extan1 = $extan1\n";

my $extan2 = $string;
$extan2 =~ s/\x{00B7}//g;
$extan2 =~ s/\x{0065}//g;
$extan2 =~ s/\x{2022}//g;
print "extan2 = $extan2\n";

输出:

string = This is some text with a ·•· middle dot or four ·· in it
string = This is some text with a ·•· middle dot or four ·· in it
answer = This is some text with no • middle dot in it
altans = This is some text with a • middle dot or four  in it
extan1 = This is some text with a • middle dot or four  in it
extan2 = This is som txt with a  middl dot or four  in it

这是Perl:TMTOWTDI - 有多种方法可以做到!

答案 2 :(得分:0)

这是一个使用你自己的正则表达式的一般答案,略有修改

$_ =~ s/([^\x{00b7}]*+)\x{00b7}+/$1/g;

反向(首选)等价物是

$_ =~ s/\x{00b7}+//g;