我有一长串数字(一个单列有500万行),它们并不是唯一的。我想看看其中哪几千个是最频繁出现的。关于如何轻松实现此目标的任何想法?我也可以使用excel或python脚本或其他方式。
答案 0 :(得分:6)
在Bash中:
sort filename | uniq -c | sort -nr
答案 1 :(得分:2)
这是使用csv.reader
和collections.Counter
的Python的一种方法:
import csv
from collections import Counter
from itertools import chain
from io import StringIO
mystr = StringIO("""1
2
3
3
1
1""")
# replace mystr with open('file.csv', 'r')
with mystr as fin:
# define lazy reader object
reader = csv.reader(mystr)
# flatten, convert to int, feed to Counter object
c = Counter(map(int, chain.from_iterable(reader)))
# calculate 2 most common items, return number and counts
print(c.most_common(2))
[(1, 3), (3, 2)]
答案 2 :(得分:2)
汤姆在Python中的方法:
d = dict()
import sys
for filename in sys.argv[1:]:
file = open(filename, 'r')
for line in file.read().splitlines():
if line not in d:
d[line] = 1
else:
d[line] += 1
file.close()
import operator
print "Item,Count"
for line in sorted(d.items(), key=operator.itemgetter(1)):
print line[0] + "," + str( line[1] )
用法:
python linesorter.py filename1.txt filename2.txt filename_...