我一直在尝试打印最频繁的行,并删除有关在第一个字段中包含大量不同值的大文件中的制表符分隔的键值的重复项;
示例输入:
a|gofortheeyeboo 0.61
a|gofortheeyeboo 0.81
a|gofortheeyeboo 0.81
a|gofortheeyeboo 0.81
a|gofortheeyeboo 0.81
a|gofortheeyeboo 0.81
a|gofortheeyeboo 0.91
a|gofortheeyeboo-gone 0.07
a|gofortheeyeboo-gone 0.07
a|gofortheeyeboo-abouttogone 0.61
a|gofortheeyeboo-abouttogone 0.12
b|attaack-attack 0.07
不同键的所需输出:
a|gofortheeyeboo 0.81
a|gofortheeyeboo-gone 0.07
a|gofortheeyeboo-abouttogone 0.61
a|gofortheeyeboo-abouttogone 0.12
b|attaack-attack 0.07
到目前为止,管理得到第二个制表符分隔字段中最大值的输出,删除重复项;
awk -F '\t' '{ if (l[$1] <= $2) l[$1] = $2} END {for (i in l) print i"\t"l[i];}'
上面命令的输出,这是不需要的;
a|gofortheeyeboo 0.91
a|gofortheeyeboo-abouttogone 0.61
b|attaack-attack 0.07
a|gofortheeyeboo-gone 0.07
答案 0 :(得分:1)
sort input | uniq -c | sort -nr | \
awk 's[$2] == $1 { print $2,$3} !s[$2] { print $2,$3; s[$2]=$1; }'
答案 1 :(得分:-1)
keys = {}
for line in sys.stdin:
line = line.strip()
k, v = line.split('\t')
if k in keys:
if v in keys[k]:
keys[k][v] += 1
else:
keys[k][v] = 1
else:
keys[k] = {v: 1}
for k in keys:
items = keys[k].items()
# Some pair emerged more than once
if any(map(lambda x: x[1] > 1, items)):
# Calucalte max frequence
freq = reduce(
lambda acc, e: acc if acc[1] > e[1] else e,
items
)[0]
print '{0}\t{1}'.format(k, freq)
# None pair emereged more than once
else:
# Print every pair
for v in items:
print '{0}\t{1}'.format(k, v[0])