CSV列中数百万个数据值的频率

时间:2018-07-14 22:01:37

标签: python excel csv histogram frequency

我有一长串数字(一个单列有500万行),它们并不是唯一的。我想看看其中哪几千个是最频繁出现的。关于如何轻松实现此目标的任何想法?我也可以使用excel或python脚本或其他方式。

3 个答案:

答案 0 :(得分:6)

在Bash中:

sort filename | uniq -c | sort -nr

答案 1 :(得分:2)

这是使用csv.readercollections.Counter的Python的一种方法:

import csv
from collections import Counter
from itertools import chain
from io import StringIO

mystr = StringIO("""1
2
3
3
1
1""")

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    # define lazy reader object
    reader = csv.reader(mystr)
    # flatten, convert to int, feed to Counter object
    c = Counter(map(int, chain.from_iterable(reader)))

# calculate 2 most common items, return number and counts
print(c.most_common(2))

[(1, 3), (3, 2)]

答案 2 :(得分:2)

汤姆在Python中的方法:

d = dict()

import sys
for filename in sys.argv[1:]:
    file = open(filename, 'r')
    for line in file.read().splitlines():
        if line not in d:
            d[line] = 1
        else:
            d[line] += 1
    file.close()

import operator
print "Item,Count"
for line in sorted(d.items(), key=operator.itemgetter(1)):
    print line[0] + "," + str( line[1] )

用法:

python linesorter.py filename1.txt filename2.txt filename_...