Question

我有一个制表符分隔文件，其中最后十五个字段由0和1组成。它需要做的是在由五个字段组分隔的十五个字段之间打印不超过五个连续零或超过五个连续零的行。

文件：

abadenguísimo   abadenguísimo   adjective   n/a n/a singular    n/a masculine   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
abalaustradísimo    abalaustradísimo    adjective   n/a n/a singular    n/a masculine   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
abiertísimas    abiertísimo adjective   n/a n/a plural  n/a feminine    1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
abellacadísimo  abellacadísimo  adjective   n/a n/a singular    n/a masculine   1   0   1   1   1   0   0   1   0   0   1   0   0   0   0
cansonísimos    cansonísimo adjective   n/a n/a plural  n/a masculine   0   1   1   1   0   0   0   0   1   0   0   0   0   0   1

输出：

abellacadísimo  abellacadísimo  adjective   n/a n/a singular    n/a masculine   1   0   1   1   1   0   0   1   0   0   1   0   0   0   0
cansonísimos    cansonísimo adjective   n/a n/a plural  n/a masculine   0   1   1   1   0   0   0   0   1   0   0   0   0   0   1

我试过了：

BEGIN {
    FS = "\t"

    }
    {
    a=0;
    b=0;
    c=0;

    num[A]="";
    num[B]="";
    num[C]="";


        for ( i = 9; i <= 13; i++)
            num[A]=num[A]""$i;
        for (j = 14; j <= 18; j++)
            num[B]=num[B]""$j;
        for (k = 19; k <= 23; k++)
            num[C]=num[C]""$k;



    if ((num[A] != "00000") && (num[A] != "11111")) {
        a=1;
    }
    if (num[B] != "00000") {
        b=1;
    }
    if (num[C] != "00000") {
        c=1;
    }
    if ((a == 1) || (b == 1) || (c == 1)) {
        print;
    }
    }

最后我想我找到了一个解决方案，我不知道为什么其他代码对我不起作用。

BEGIN {
FS = "\t"
cont=0;
}

{
a=0;
b=0;
c=0;

sum1=$9+$10+$11+$12+$13;
sum2=$14+$15+$16+$17+$18;
sum3=$19+$20+$21+$22+$23;

if (( sum1 > 0 ) && ( sum1 < 5 )) {
a=1;
}
if ( sum2 > 0 ) {
b=1;
}
if ( sum3 > 0 ) {
c=1;
}

if ((a == 1) || (b == 1) || (c == 1)) {

cont++;
print;
}

}

END {
print "Total: "NR;
print "OK: "cont; 
}

Answer 1

如果你将你的要求从英语翻译成正则表达式然后给grep，它会做你想要的：

grep -vE '(1\s+){6,}|(0\s+){6,}' file

您可以调整\s+，例如将其更改为\t或其他符合您需求的内容。

更新

awk -F'\t' '{s=NF-15+1
            c=i=0
            while(++c<=3){
                    x=i?i:s 
                    t=0
                    for(i=x;i<x+5;i++) t+=$i+0
                    if(t==0||t==5) next
            }
            print
    }' file

这会给你预期的输出。它检查＆＃34;超过四个连续的零/一个＆＃34;而不是五个，因为每个组都有最大值。 5个元素/列，＆＃34;＆gt; 5＆＃34;永远不会发生。

Answer 2

awk 4

awk 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file

awk 3.1

awk --posix 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file

更新

awk '{for(i=9;i<=NF;i++){a[$i];if(++c==5){l=length(a);delete a;c=0;if(l>1){print;break}}}}' file

Answer 3

grep中的以下ERE适用于您的输入数据，其中所有三个五人组都有匹配的内容：

egrep -v '(\s+[01])\1\1\1\1(\s+[01])\2\2\2\2(\s+[01])\3\3\3\3' file

由于您的问题已标记为awk，因此请在awk中表达。

我们不能在awk中做同样的事情，因为awk传统上不支持正则表达式中的反向引用。因此，正如您的脚本所示，以编程方式执行此操作可能就是答案。您的解决方案连接字段并比较字符串。我想我可能会使用算术 - 五个字段的总和是从零到五的数字。值为零或五表示＆＃34;跳过＆＃34;，其他任何表示＆＃34;打印＆＃34;。

#!/usr/bin/awk -f

{

  # Count back from the end in groups of five, until we hit e field
  # that is neither "0" nor "1"...
  start=NF;
  while ($start ~ /^[01]$/) {
    group++;
    for(i=start;i>start-5;i--) { sum[group]+=$i; }
    start=i;
  }

  # Step through groups, adding a condition to a counter.
  # At the end of the loop, if found > 0, then we've found a line
  # that does not have the pattern specified.
  found=0;
  while (--group) {
    found+=(sum[group] > 1 && sum[group] < 5);
  }

}

# If found > 0, print the line.
found

AWK：打印与图案匹配的线条

3 个答案:

更新