Question

经过对搜索引擎优化和谷歌的大量搜索，我试图发布一个新问题。我正在使用TextWrangler尝试编写一个正则表达式，它将为我提供多行模式的最短匹配。

基本上，

ہے\tVM

是我要查找的字符串（阿拉伯语单词由其词性标记中的制表符分隔）。难以理解的是，我想搜索包含该字符串的所有单个句子。以下是我到目前为止的情况：

/(<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*ہے\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>)/

我正在查看的文件是用CML编码的，所以我的一部分问题是你们中是否有人知道MAC的CML解析器？

另一个明显的选择是写一个Perl脚本 - 再次，我感谢任何指向简单解决方案的建议。

我目前的脚本是：

use open ':encoding(utf8)';
use Encode;
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");

my $word = Encode::decode_utf8("ہے");

my @files = glob("*.posn");

foreach my $file (@files) {
    open FILE, "<$file" or die "Error opening file $file ($!)";
    my $file = do {local $/; <FILE>};
    close FILE or die $!;
    if ($file =~ /(<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*$word\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>)/g) {
            print STDOUT "$1\n\n\n\n";
            push(@matches, "$1\n\n");
            }
}

open(OUTPUT, ">matches.txt");
print OUTPUT "@matches";
close(OUTPUT);

Answer 1

您可能在输入中出现更多字符串，因此请搜索所有字符串......

我相信您的代码应该如下＆gt;＆gt;

use open ':encoding(utf8)';
use Encode;

binmode(STDOUT, ":utf8");
binmode(STDIN,  ":utf8");

my $word = Encode::decode_utf8("ہے");
my @files = glob("*.posn");
my @matches = ();

foreach my $file (@files) {
  open FILE, "<$file" or die "Error opening file $file ($!)";
  my $file = do {local $/; <FILE>};
  close FILE or die $!;
  my @occurrences = $file =~ /<Sentence id='\d+'>(?:[^<]|<(?!\/Sentence>))*$word\tVM(?:[^<]|<(?!\/Sentence>))*<\/Sentence>/g;
  print STDOUT "$_\n\n\n\n" for (@occurrences);
  push (@matches, "$_\n\n") for (@occurrences);
}

open (OUTPUT, ">matches.txt");
print OUTPUT  "@matches";
close(OUTPUT);

详细了解正则表达式here。

正则表达式搜索多行

1 个答案: