我想在Perl中匹配评论。
#
内部字符串不是评论以下是一个示例,需要捕获每个字符串和注释,并在以后突出显示。
# this is a comment, should be matched.
# # "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on
each line #, have fun!";
my $descap_string = "I am a \ escaped \" \"string"; # and some comments;
my $sescap_string = 'I am a \ escaped \' \'string'; # and some comments;
my $empty_d ="";
my $empty_s ='';
我尝试了很多东西,但无法找到解决所有情况的解决方案。
答案 0 :(得分:2)
要做到这一点,您只需要依赖代码的有序性。基本上,提出引号和注释的正则表达式,并将它们放在单个正则表达式的“或”列表中。
以下是我所谈论的内容:
use strict;
use warnings;
my $dquo_re = qr{...};
my $squo_re = qr{...};
my $comment_re = qr{...};
my $src = do {local $/; <DATA>};
while ($src =~ /($dquo_re)|($squo_re)|($comment_re)/g) {
if (defined $1) {
print "Double quote found: $1\n";
} elsif (defined $2) {
print "Single quote found: $2\n";
} elsif (defined $3) {
print "Comment found: $3\n";
}
}
__DATA__
# this is a comment, should be matched.
# "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on
each line #, have fun!";
<强>更新强>
因为您已经展示了自己的作品并提出了自己的解决方案,所以我将展示3个正则表达式,这些正则表达式将匹配大多数单引号和双引号字符串和注释。
my $dquo_re = qr{"(?:(?>[^"\\]+)|\\.)*"};
my $squo_re = qr{'(?:(?>[^'\\]+)|\\.)*'};
my $comment_re = qr{(?<!\$)#.*};
输出:
Comment found: # this is a comment, should be matched.
Comment found: # "I am not a string" . 'because I am inside a comment'
Double quote found: " #I am not a comment, because I am quoted"
Double quote found: "I am a multiline string with # on
each line #, have fun!"
不过,最完整的方法是使用PPI
use strict;
use warnings;
use PPI;
my $src = do {local $/; <DATA>};
# Load a document
my $doc = PPI::Document->new( \$src );
my $matches = $doc->find(sub{
grep {$_[1]->isa("PPI::Token::$_")} qw(Comment Quote)
});
for (@$matches) {
if ($_->isa('PPI::Token::Comment')) {
print "Comment: ", $_->content;
} elsif ($_->isa('PPI::Token::Quote')) {
print "Quote: ", $_->content, "\n";
}
}
__DATA__
# this is a comment, should be matched.
# "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on
each line #, have fun!";
答案 1 :(得分:1)
我终于意识到这可能太难了,如果不是不可能的正则表达式,所以我开始研究正常的脚本。
使用index
和substr
功能非常简单。
这是我的代码的第3版,感谢Miller指出了一些错误。
这是我的代码
#!/usr/bin/env perl
use strict;
use warnings;
my $src = do {local $/; <DATA>};
my @strings = ();
my @comments = ();
my $off_set = 0;
my $end_index = 0;
while (my ($char, $start_index) = &next_char($off_set)) {
last if ($char eq "" && $start_index == -1);
if ($char eq '#') {
&capture_comment($start_index);
} elsif (($char eq '"') || ($char eq "'")) {
&capture_string($char, $start_index, $end_index);
}
}
print "[Strings]\n";
foreach my $item (@strings) {
print "$item\n";
}
print "[Comments]\n";
foreach my $item (@comments) {
print "$item";
}
sub capture_comment($) {
my $start_index = shift;
my $char_before = substr $src, $start_index-1, 1;
# print "\$char_before before # is $char_before\n";
if ((substr $src, $start_index-1, 1) ne "\$") {
$end_index = index $src, "\n", $start_index + 1;
push @comments, substr($src, $start_index, $end_index-$start_index+1);
$off_set = $end_index + 1;
} else {
$off_set = $start_index + 1;
# print "Array index variable found\n";
}
}
sub capture_string($ $ $) {
my $quote = shift;
my $start_index = shift;
my $end_index = shift;
$end_index = index ($src, $quote, $start_index+1);
CHECK_BACKSLASH:
my $char_before = substr $src, $end_index-1, 1;
# print "\$char_before is $char_before\n";
if ($char_before eq '\\') {
# print "There is a \\ before $quote\n";
# print "end index before checking backslash $end_index \n";
if (&odd_number_backslash($char_before, $start_index, $end_index) == 1) {
# print "end index after checking backslash $end_index \n";
$end_index = index $src, $quote, $end_index + 1;
# print "end index after checking backslash and another index $end_index \n";
goto CHECK_BACKSLASH;
}
}
push @strings, substr($src, $start_index, $end_index-$start_index+1);
$off_set = $end_index + 1;
}
sub odd_number_backslash($ $ $) {
my $char_before = shift;
my $start_index = shift;
my $end_index = shift;
my $count = 0;
if ($char_before eq '\\') {
my $ts = substr $src, $start_index, $end_index-$start_index;
# print "\$ts is $ts\n";
while ($count <= length $ts) {
if (chop $ts eq '\\') {
$count++;
} else {
last;
}
}
# print "\$count is $count\n";
return ($count % 2);
} else {
# print "else \$count is $count\n";
return 1;
}
}
sub next_char {
my %has;
my $position = shift;
my $s_index = index $src, "'", $position;
my $d_index = index $src, '"', $position;
my $c_index = index $src, '#', $position;
return ("", -1) if ($s_index == -1 &&
$d_index == -1 &&
$c_index == -1);
$has{$s_index} = "'" if ($s_index >= 0);
$has{$d_index} = '"' if ($d_index >= 0);
$has{$c_index} = '#' if ($c_index >= 0);
my @sorted_keys = sort { $a <=> $b} keys %has;
# print "Next char is $has{$sorted_keys[0]}, and position is $sorted_keys[0]\n";
return ($has{$sorted_keys[0]}, $sorted_keys[0]);
}
__DATA__
my $string = "this is a \" string";
my $windows_path = "C:\\somewhere\\not\\important\\"; # and a comment " yep
# this is a comment, should be matched.
# # "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on
each line #, have fun!";
my @list = (0..99);
print $#list;
my $descap_string = "I am a \ escaped \" \"string"; # and some comments after double;
my $sescap_string = 'I am a \ escaped \' \'string'; # and some comments after single;
my $sescap_string = 'I am a \ escaped \' \'\'\'\'\\'; # and some ' comments by Miller;
my $windows_path = "C:\\somewhere\\not\\important\\"; # and a comment ", yep
my @array = (1..12);
my $empty_d ="";
my $empty_s ='';
输出
[Strings]
"this is a \" string"
"C:\\somewhere\\not\\important\\"
" #I am not a comment, because I am quoted"
"I am a multiline string with # on
each line #, have fun!"
"I am a \ escaped \" \"string"
'I am a \ escaped \' \'string'
'I am a \ escaped \' \'\'\'\'\\'
"C:\\somewhere\\not\\important\\"
""
''
[Comments]
# and a comment " yep
# this is a comment, should be matched.
# # "I am not a string" . 'because I am inside a comment'
# and some comments after double;
# and some comments after single;
# and some ' comments by Miller;
# and a comment ", yep