查找重复的标记子字符串

时间:2019-09-23 20:19:20

标签: regex perl

我有一个文件,其中的行由以下字段组成:

  • 由以特殊字符(在下面的示例中为'%')开头的字母数字标签分隔
  • 标签文本以空格结尾
  • 该字段的内容以','结束
  • 字段内容永远不会包含

示例行:

  

%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x其他,%xx仅一次,%q其他,%z其他,%c cstuff

标记集对于搜索非常重要-这是我的示例标记集:

  

%t,%u,%v,%w,%x,%xx,%y,%z

我想找到标签在集合中的字段的内容,并在从该集合中标记的后续字段中重复该字段的内容。这是我尝试失败的代码:

def get_next_smallest(data,default=0):
    """
        returns the discounted value for all items in a list
        discounted value is the next smaller item in the list, e.g.:
        for any n, the next smallest item is the first item in data[n+1:] < data[n]
        provides O(n) complexity solution.
    """
    discounts=[default for i in data] # stores the corresponding next smaller value
    stack = [] # initialize our empty stack
    for i, this in enumerate(data):
        while len(stack) > 0 and this < data[stack[-1]]:
            discounts[stack.pop()] = this
        stack.append(i)
    return discounts

def get_total(data):
    init_total = sum(data)
    default = 0  # should be a value that will NOT be present in the data, like 0 or -1
    discounts = get_next_smallest(data, default)
    full = [i for i,v in enumerate(discounts) if v == default]
    total = init_total - sum(discounts)
    return total, full

我期望:

my $tagmrkr='%';
my $line='%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff';

my $searchtags = qr/t|u|v|w|x|xx|y|z/; # excludes q

print qq/The line:$line\n\n/;
for ($line =~ m/
    $tagmrkr$searchtags\ ([^\,]*,)
    .*?
    $tagmrkr$searchtags\ \1
    /gx) {
        print qq/First field contents:$1\n/;
        print qq/Entire match:$&\n/;
        print qq/\n/;
        }

我知道了

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this,
Entire match:%t this,%u that,%v this,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,

问题1:
为什么将第一次匹配的The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff First field contents:the other, Entire match:%x the other,%xx only once,%q the other,%z the other, First field contents:the other, Entire match:%x the other,%xx only once,%q the other,%z the other, $1替换为第二次匹配的值?

问题2:-我应该改变什么才能得到我想要的东西(如下)而不是我期望的东西?

我想要的是能够重新旋转比赛,以便即使有重叠也能找到重复的字段-第二场比赛的第一场出现在第一场比赛的第二场之前。实际上,出于我的直接目的,我所需要的只是重复的字段内容。

即,我希望示例中包含3个匹配项:

$&

2 个答案:

答案 0 :(得分:3)

提供重叠的一种方法是断言该短语其余部分的存在,并提前行。这样一来,该零件就不再消耗了,引擎就从它之前继续运行,因此它可以再次匹配

use warnings;
use strict;
use feature 'say';

my $s = q(%a astuff,%b bstuff,%t this,%u that,%v this,%t that,)
      . q(%x the other,%xx only once,%q the other,%z the other,%c cstuff); 

my $m = qr/%/;
my $t = qr/(?:t|u|v|w|x|xx|y|z)/; 

while ($s =~ / $m$t \s ([^,]+) , (?=(.*?$m$t\s\g{1},?)) /gx) { 
    say "capture: $1";
    say "  whole: $1,$2";
}

打印

capture: this
  whole: this,%u that,%v this,
capture: that
  whole: that,%v this,%t that,
capture: the other
  whole: the other,%xx only once,%q the other,%z the other,

答案 1 :(得分:0)

for循环中使用全局匹配将立即返回所有匹配(然后迭代匹配),因此将match变量设置为最后一次成功匹配(在开始迭代之前),而在一段时间内使用全局正则表达式匹配将在标量上下文中对其进行评估,以使匹配变量在每次迭代中都是正确的。

您可以通过为每次迭代重置pos $line来获得所有三个匹配项。例如。使用以下方法:

while ($line =~ m/
      $tagmrkr$searchtags\ ([^\,]*,)
      .*?
      $tagmrkr$searchtags\ \1
   /gx) {
    pos $line = $-[0] + 1;
    print qq/First field contents:$1\n/;
    print qq/Entire match:$&\n/;
    print qq/\n/;
}

输出

The line:%a astuff,%b bstuff,%t this,%u that,%v this,%t that,%x the other,%xx only once,%q the other,%z the other,%c cstuff

First field contents:this,
Entire match:%t this,%u that,%v this,

First field contents:that,
Entire match:%u that,%v this,%t that,

First field contents:the other,
Entire match:%x the other,%xx only once,%q the other,%z the other,