在awk或python中重复计数和条件筛选

时间:2013-12-15 04:44:37

标签: python python-2.7 awk

我有一个数据集如下:

a  b  2.7
a  b  9.4
a  b  6.9
x  l  0.004
y  m  0.5

表示锄头很多重复

我需要在第2列中删除重复项并进行折叠,但在第3列中获取重复项的最低值。如果没有看到重复,则按原样打印。也 如果col 2是相同的打印最低col 3 产品希望:

3 a b 2.7
1 x  l  0.004
1 y  m  0.5

我到目前为止已经排序以获得重复的计数(如col1所示)。但是我无法继续获得最低的3col值。我想在awk或python中实现这一点。请帮忙!

sort -k2,2nr myfile.txt| less

GENEART。

3 个答案:

答案 0 :(得分:1)

在Python中:

summary = {}

# ** If order is important, use collections.OrderedDcit **
#
#import collections
#summary = collections.OrderedDict()

with open('dataset.txt') as f:
    for line in f:
        col1, col2, value = line.split()
        value = float(value)
        if col2 not in summary:
            summary[col2] = [0, col1, value] # count, col1, col3
        else:
            if value < summary[col2][1]:
                summary[col2][1] = col1
                summary[col2][2] = value
        summary[col2][0] += 1

for col2, s in summary.iteritems():
    print '{0[0]} {0[1]} {1} {0[2]}'.format(s, col2)

在awk中:

awk '{if (!($2 in min) || $3<min[$2]) {min[$2]=$3; col1[$2]=$1} cnt[$2]++} \
     END{for (i in cnt) print cnt[i]" "col1[i]" "i" "min[i]}' dataset.txt

答案 1 :(得分:1)

您可以像这样使用itertools.groupby

with open("Input.txt") as inFile:
    lines = [line.split() for line in inFile]
from itertools import groupby
from operator import itemgetter
getCol2 = itemgetter(1)
for col2, grp in groupby(sorted(lines, key = getCol2), getCol2):
    grp = list(grp)
    res = [len(grp)] + min(grp, key = getCol2)
    print " ".join(map(str, res))

<强>输出

3 a b 2.7
1 x l 0.004
1 y m 0.5

答案 2 :(得分:0)

Gawk

awk --version | head -1

GNU Awk 4.1.0,API:1.0(GNU MPFR 3.1.2,GNU MP 4.3.2)

awk '{str=$1 FS $2;if (!(str in min) || $3<min[str]) min[str]=$3;sum[str]++} 
    END {for (i in sum) print sum[i],i,min[i]}' myfile.txt