我有一个数据集如下:
a b 2.7
a b 9.4
a b 6.9
x l 0.004
y m 0.5
表示锄头很多重复
我需要在第2列中删除重复项并进行折叠,但在第3列中获取重复项的最低值。如果没有看到重复,则按原样打印。也 如果col 2是相同的打印最低col 3 产品希望:
3 a b 2.7
1 x l 0.004
1 y m 0.5
我到目前为止已经排序以获得重复的计数(如col1所示)。但是我无法继续获得最低的3col值。我想在awk或python中实现这一点。请帮忙!
sort -k2,2nr myfile.txt| less
GENEART。
答案 0 :(得分:1)
在Python中:
summary = {}
# ** If order is important, use collections.OrderedDcit **
#
#import collections
#summary = collections.OrderedDict()
with open('dataset.txt') as f:
for line in f:
col1, col2, value = line.split()
value = float(value)
if col2 not in summary:
summary[col2] = [0, col1, value] # count, col1, col3
else:
if value < summary[col2][1]:
summary[col2][1] = col1
summary[col2][2] = value
summary[col2][0] += 1
for col2, s in summary.iteritems():
print '{0[0]} {0[1]} {1} {0[2]}'.format(s, col2)
在awk中:
awk '{if (!($2 in min) || $3<min[$2]) {min[$2]=$3; col1[$2]=$1} cnt[$2]++} \
END{for (i in cnt) print cnt[i]" "col1[i]" "i" "min[i]}' dataset.txt
答案 1 :(得分:1)
您可以像这样使用itertools.groupby
with open("Input.txt") as inFile:
lines = [line.split() for line in inFile]
from itertools import groupby
from operator import itemgetter
getCol2 = itemgetter(1)
for col2, grp in groupby(sorted(lines, key = getCol2), getCol2):
grp = list(grp)
res = [len(grp)] + min(grp, key = getCol2)
print " ".join(map(str, res))
<强>输出强>
3 a b 2.7
1 x l 0.004
1 y m 0.5
答案 2 :(得分:0)
Gawk
awk --version | head -1
GNU Awk 4.1.0,API:1.0(GNU MPFR 3.1.2,GNU MP 4.3.2)
awk '{str=$1 FS $2;if (!(str in min) || $3<min[str]) min[str]=$3;sum[str]++}
END {for (i in sum) print sum[i],i,min[i]}' myfile.txt