Question

我有一个数据集如下：

a  b  2.7
a  b  9.4
a  b  6.9
x  l  0.004
y  m  0.5

表示锄头很多重复

我需要在第2列中删除重复项并进行折叠，但在第3列中获取重复项的最低值。如果没有看到重复，则按原样打印。也如果col 2是相同的打印最低col 3 产品希望：

3 a b 2.7
1 x  l  0.004
1 y  m  0.5

我到目前为止已经排序以获得重复的计数（如col1所示）。但是我无法继续获得最低的3col值。我想在awk或python中实现这一点。请帮忙！

sort -k2,2nr myfile.txt| less

GENEART。

Answer 1

在Python中：

summary = {}

# ** If order is important, use collections.OrderedDcit **
#
#import collections
#summary = collections.OrderedDict()

with open('dataset.txt') as f:
    for line in f:
        col1, col2, value = line.split()
        value = float(value)
        if col2 not in summary:
            summary[col2] = [0, col1, value] # count, col1, col3
        else:
            if value < summary[col2][1]:
                summary[col2][1] = col1
                summary[col2][2] = value
        summary[col2][0] += 1

for col2, s in summary.iteritems():
    print '{0[0]} {0[1]} {1} {0[2]}'.format(s, col2)

在awk中：

awk '{if (!($2 in min) || $3<min[$2]) {min[$2]=$3; col1[$2]=$1} cnt[$2]++} \
     END{for (i in cnt) print cnt[i]" "col1[i]" "i" "min[i]}' dataset.txt

Answer 2

您可以像这样使用itertools.groupby

with open("Input.txt") as inFile:
    lines = [line.split() for line in inFile]
from itertools import groupby
from operator import itemgetter
getCol2 = itemgetter(1)
for col2, grp in groupby(sorted(lines, key = getCol2), getCol2):
    grp = list(grp)
    res = [len(grp)] + min(grp, key = getCol2)
    print " ".join(map(str, res))

<强>输出

3 a b 2.7
1 x l 0.004
1 y m 0.5

Answer 3

Gawk

awk --version | head -1

GNU Awk 4.1.0，API：1.0（GNU MPFR 3.1.2，GNU MP 4.3.2）

awk '{str=$1 FS $2;if (!(str in min) || $3<min[str]) min[str]=$3;sum[str]++} 
    END {for (i in sum) print sum[i],i,min[i]}' myfile.txt

在awk或python中重复计数和条件筛选

3 个答案: