的更新

Question

我正在尝试删除逗号分隔文件中的行，其中APPID相同且“类别”列属于同一类别。输入：

1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-1 ,,,,,,,, Cell ,
5002 , APP-1 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,

理想输出：

1,APPID,3,4,5,6,7,8,9,Category ,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,

“APP-1”已删除，因为它们的第2列相同且其“类别”列都是“单元格”。

保留“APP-2”因为它们的“类别”列中有一个“细胞”而另一个是“生物化学”。

“APP-3”中的类似方案，其“类别”列包含异构类别。

（更新）“APP-4”被保留，因为它们的列包含异构类别。我们想继续重复“5002，APP-4 ......”，这将在下一个脚本中处理。此步骤正在快速删除数万个“类别”列中的同类数据点（如果它们的APPID相同），以便下一个脚本中的数组不会爆炸。

到目前为止的尝试似乎不起作用（参考此处：removal of redundant lines based on value in last column）

  awk -F " ," '!a[$1,$2,$3,$4,$5,$6,$7,$8,$9]++' input

每个文件的流程文件大约为一百万行，总共需要处理大约400个文件。执行速度似乎在这里至关重要。任何古茹都可以开导吗？谢谢！

Answer 1

def killDups(infilepath, outfilepath):
    data = {}
    with open(infilepath) as infile:
        infile.readline()
        for i,line in enumerate(infile):
            line = line.strip()
            cols = [col.strip() for col in line.split(',')]
            appid, cat = cols[1], cols[-1]
            if appid not in data:
                data[appid] = {cat:i}
            elif cat in data[appid]:
                data[appid].pop(cat)

    whitelist = set()
    for k,v in data.items():
        whitelist.update(v.values())

    with open(infilepath) as infile, open(outfilepath, 'w') as outfile:
        outfile.write(infile.readline())
        for i,line in enumerate(infile):
            if i in whitelist:
                outfile.write(line)

Answer 2

$ awk -F, '
  { key=$2 FS $(NF-1); nr2key[NR]=key; key2val[key]=$0; cnt[key]++ }
  END {
      for (i=1;i<=NR;i++) {
          key=nr2key[i]
          if (cnt[key] == 1) {
              print key2val[key]
          }
      }
  }
  ' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,

Answer 3

以下是awk的另一种方式：

awk -F, '
!patt[$2,$(NF-1)]++ { lines[$2,$(NF-1)] = $0 } 
END {
    for (line in lines)
      if (patt[line] == 1)
        print lines[line]
}' file | sort -t, -nk1,2
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-3 ,,,,,,,, Cell ,

如果两列不在patt数组中，则将整行分配给相同键的行数组
在END块中迭代行数组。如果模式数组的键计数为1，则打印该行。
要对输出管道进行排序以进行排序。

注意：要使用vanilla awk更优雅的方式，请参阅Ed Morton's解决方案。

如果你有GNU awk那么（类似的逻辑，但使用内置的排序算法）：

gawk -F, '
BEGIN { PROCINFO["sorted_in"] = "@ind_num_desc" }
!patt[$2,$(NF-1)]++ {
    lines[$2,$(NF-1)] = $0
}
END {
    for (line in lines)
      if (patt[line] == 1)
        print lines[line]
}' file

如果您可以使用perl：

perl -F, -lane'                        
    print and next if $.==1;        # print the header
    $key = "@F[1,-1]";              # form the key using two columns
    $h{$key} or push @rec, $key;    # if key is not in hash push to array (for order)
    push @{$h{$key}}, $_            # create hash of arrays
}{                                  # In the END block ...
    print @{$h{$_}} for grep { @{$h{$_}} == 1 } @rec   # print line whose array count is 1
' file 
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,

的更新

perl -F, -lane' print and next if $.==1; $seen{$F[1],$F[-1]}++ or push @rec, [$F[1], $F[-1]]; push @{$h{$F[1]}{$F[-1]}}, $_ }{ for (@rec) { next if keys %{$h{$_->[0]}} == 1; print join "\n", @{$h{$_->[0]}{$_->[1]}}; } ' file 1,APPID,ID2,ID3,5,6,7,8,9,Category, 5002 , APP-2 ,,,,,,,, Cell , 5002 , APP-2 ,,,,,,,, Enzyme , 5002 , APP-3 ,,,,,,,, Cell , 5002 , APP-3 ,,,,,,,, Biochemical , 5002 , APP-4 ,,,,,,,, Enzyme , 5002 , APP-4 ,,,,,,,, Enzyme , 5002 , APP-4 ,,,,,,,, Enzyme , 5002 , APP-4 ,,,,,,,, Cell ,

Answer 4

这是一个GNU Awk解决方案，其中包含具有整体异构值的密钥，这些密钥可能包含重复项，例如APP-4中的重复项：

BEGIN {
    FS=","
    OFS=","
}
{
    key[NR]=$2
    count[$2]++
    v=$(NF-1)
    val[NR]=v
    val_count[$2][v]++
    line[NR]=$0
}
END {
    for(i=1;i<=NR;i++) {
        k=key[i]
        j=val[i]
        if(count[k] > 1) {
            if(val_count[k][j] == count[k]) {
                continue
            }else{
                print line[i]
            }
        }else{
            print line[i]
        }
    }
}

您可以将其创建为Awk文件，并将其命名为hetero.awk，并从shell运行脚本，如下所示：

gawk -f hetero.awk file

输出：

1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,

或者，对于更脏的方法，您可以将以下内容放在shell脚本中：

gawk -F, -v OFS=, '{
    key[NR]=$2
    count[$2]++
    v=$(NF-1)
    val[NR]=v
    val_count[$2][v]++
    line[NR]=$0
}END{
    for(i=1;i<=NR;i++) {
        k=key[i]
        j=val[i]
        if(count[k] > 1) {
            if(val_count[k][j] == count[k]) {
                continue
            }else{
                print line[i]
            }
        }else{
            print line[i]
        }
    }
}' file

作为一般惯例，我更喜欢在我的bash脚本中仅使用Awk一个衬里。

请注意，这会使用数组，这些数组不是awk变体中可用的功能，例如mawk。

根据列中的类别删除行

4 个答案:

的更新