Question

我有一长串数字（一个单列有500万行），它们并不是唯一的。我想看看其中哪几千个是最频繁出现的。关于如何轻松实现此目标的任何想法？我也可以使用excel或python脚本或其他方式。

Answer 1

在Bash中：

sort filename | uniq -c | sort -nr

Answer 2

这是使用csv.reader和collections.Counter的Python的一种方法：

import csv
from collections import Counter
from itertools import chain
from io import StringIO

mystr = StringIO("""1
2
3
3
1
1""")

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    # define lazy reader object
    reader = csv.reader(mystr)
    # flatten, convert to int, feed to Counter object
    c = Counter(map(int, chain.from_iterable(reader)))

# calculate 2 most common items, return number and counts
print(c.most_common(2))

[(1, 3), (3, 2)]

Answer 3

汤姆在Python中的方法：

d = dict()

import sys
for filename in sys.argv[1:]:
    file = open(filename, 'r')
    for line in file.read().splitlines():
        if line not in d:
            d[line] = 1
        else:
            d[line] += 1
    file.close()

import operator
print "Item,Count"
for line in sorted(d.items(), key=operator.itemgetter(1)):
    print line[0] + "," + str( line[1] )

用法：

python linesorter.py filename1.txt filename2.txt filename_...

CSV列中数百万个数据值的频率

3 个答案: