如何找到混合了Perl的字符串?

时间:2009-12-08 15:11:59

标签: regex perl ack

我正在尝试过滤数千个文件,寻找那些包含带大小写混合大小写的字符串常量的文件。这些字符串可以嵌入空格中,但本身可能不包含空格。所以以下(包含UC字符)是匹配的:

"  AString "   // leading and trailing spaces together allowed
"AString "     // trailing spaces allowed
"  AString"    // leading spaces allowed
"newString03"  // numeric chars allowed
"!stringBIG?"  // non-alphanumeric chars allowed
"R"            // Single UC is a match

但这些不是:

"A String" // not a match because it contains an embedded space
"Foo bar baz" // does not match due to multiple whitespace interruptions
"a_string" // not a match because there are no UC chars

我仍想匹配包含两个模式的行:

"ABigString", "a sentence fragment" // need to catch so I find the first case...

我想使用Perl regexp,最好由ack命令行工具驱动。显然, \ w \ W 不起作用。似乎 \ S 应该与非空间字符匹配。我似乎无法弄清楚如何嵌入“每串至少一个大写字符”的要求......

ack --match '\"\s*\S+\s*\"'

是我得到的最接近的。我需要用 替换 \ S + 来捕获“至少一个大写(ascii)字符(在非空白字符串的任何位置)”要求

在C / C ++中编程很简单(是的,Perl,在程序上,不需要使用正则表达式),我只是想弄清楚是否有一个正则表达式可以完成同样的工作。

2 个答案:

答案 0 :(得分:7)

以下模式通过了所有测试:

qr/
  "      # leading single quote

  (?!    # filter out strings with internal spaces
     [^"]*   # zero or more non-quotes
     [^"\s]  # neither a quote nor whitespace
     \s+     # internal whitespace
     [^"\s]  # another non-quote, non-whitespace character
  )

  [^"]*  # zero or more non-quote characters
  [A-Z]  # at least one uppercase letter
  [^"]*  # followed by zero or more non-quotes
  "      # and finally the trailing quote
/x

使用此测试程序 - 使用不带/x的上述模式,因此没有空格和注释 - 作为ack-grep的输入(在{Ubuntu上调用ack

#! /usr/bin/perl

my @tests = (
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"A String">     => 0 ],
  [ q<"a_string">     => 0 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
  [ q<"  a String  "> => 0 ],
  [ q<"Foo bar baz">  => 0 ],
);

my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
for (@tests) {
  my($str,$expectMatch) = @$_;
  my $matched = $str =~ /$pattern/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",
        ": $str\n";
}

产生以下输出:

$ ack-grep '"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' try
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",

使用C shell和衍生物,你必须逃离爆炸:

% ack-grep '"(?\![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' ...

我希望我可以保留突出显示的匹配项,但这似乎不是allowed

请注意,转义的双引号(\")会严重混淆这种模式。

答案 1 :(得分:0)

您可以使用字符类添加需求,例如:

ack --match "\"\s*\S+[A-Z]\S+\s*\""

我假设ack一次匹配一行。 \S+\s*\"部分可以匹配一行中的多个结束引号。它将匹配整个"alfa"",而不仅仅是"alfa"