字母频率的排序和组织 - 蟒蛇

时间:2013-09-04 01:06:33

标签: python sorting text for-loop find-occurrences

我试图找到一种方法来计算文本文件中字母的出现次数,而不是根据频率显示从最高到最低的字母。这就是我到目前为止,请帮助克服这个脑块。

def me():
    info= input("what file would you like to select?")
    filehandle= open(info,"r")
    data=filehandle.read()
    case = data.upper()
    s=('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
    for i in range(26):
        print(s[i],case.count(s[i]))



me()

5 个答案:

答案 0 :(得分:2)

Python有一个很好的内置类:collections.Counter

In [8]: from collections import Counter

In [9]: with open('Makefile', 'r') as f:
   ...:     raw = Counter(f.read())
   ...:

In [10]: raw
Out[10]: Counter({' ': 61, 'e': 46, 'p': 38, 'a': 29, '\n': 27, 'c': 27, 'n': 27, 'l': 26, 'd': 25, '-': 22, 's': 22, 'y': 22, 't': 20, 'i': 18, 'o': 18, 'r': 17, '.': 16, 'u': 13, '\t': 12, 'm': 12, 'b': 11, 'x': 10, 'h': 9, '/': 8, ':': 8, '_': 7, "'": 6, ';': 5, '\\': 5, 'f': 5, '*': 3, 'v': 3, '{': 3, '}': 3, 'k': 2, 'H': 1, 'O': 1, 'N': 1, 'P': 1, 'Y': 1, 'g': 1})

这是来自pandas图书馆的Makefile,BTW。要按降序按频率对它们进行排序,请执行以下操作:

In [22]: raw.most_common()
Out[22]:
[(' ', 61),
 ('e', 46),
 ('p', 38),
 ('a', 29),
 ('\n', 27),
 ('c', 27),
 ('n', 27),
 ('l', 26),
 ('d', 25),
 ('-', 22),
 ('s', 22),
 ('y', 22),
 ('t', 20),
 ('i', 18),
 ('o', 18),
 ('r', 17),
 ('.', 16),
 ('u', 13),
 ('\t', 12),
 ('m', 12),
 ('b', 11),
 ('x', 10),
 ('h', 9),
 ('/', 8),
 (':', 8),
 ('_', 7),
 ("'", 6),
 (';', 5),
 ('\\', 5),
 ('f', 5),
 ('*', 3),
 ('v', 3),
 ('{', 3),
 ('}', 3),
 ('k', 2),
 ('H', 1),
 ('O', 1),
 ('N', 1),
 ('P', 1),
 ('Y', 1),
 ('g', 1)]

我故意不使用您的确切数据,以便您可以尝试使我的解决方案适应您的问题。

答案 1 :(得分:2)

这正是collections.Counter及其most_common()方法的用途:

import collections
import string

def me():
    info = input("what file would you like to select? ")
    filehandle = open(info, "r")
    data = filehandle.read().upper()
    char_counter = collections.Counter(data)
    for char, count in char_counter.most_common():
        if char in string.ascii_uppercase:
            print(char, count)

me()

Counter是一个字典,用于计算不同项目(在本例中为字符)的出现次数。 char_counter.most_common()按排序顺序为我们提供所有字符和计数对。

我们只对字母感兴趣,所以我们检查字符是否在string.ascii_uppercase。这只是一串从A到Z的字母。

答案 2 :(得分:0)

这看起来非常非常好。我希望你正确使用这个网站。 但是,很高兴你来对地方,我会尽量帮助你,至少这一次。

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> input_txt = "Now you are just somebody that I used to know"
>>> for letter in input_txt:
...     d[letter] += 1
... 
>>> import operator
>>> sorted_d = sorted(d.iteritems(), key=operator.itemgetter(1), reverse=True)
>>> sorted_d
[(' ', 9), ('o', 6), ('t', 4), ('e', 3), ('s', 3), ('u', 3), ('a', 2), ('d', 2), ('w', 2), ('y', 2), ('b', 1), ('I', 1), ('h', 1), ('k', 1), ('j', 1), ('m', 1), ('N', 1), ('r', 1), ('n', 1)]

答案 3 :(得分:0)

你可以沿着这些方向做点什么:

d={}
with open('/usr/share/dict/words') as f:
    for line in f:
        for word in line.split():
            word=word.strip()
            for c in word:
                d[c]=d.setdefault(c,0)+1

for k, v in sorted(d.items(), key=lambda t: t[1], reverse=True):
    print k,v 

对于标准的Unix单词文件,打印:

e 234413
i 200536
a 196995
o 170062
r 160269
...
Y 139
X 92
Q 77
- 2            

答案 4 :(得分:0)

其他人已经使用itertools.Counter为您提供了更好的解决方案,但您的代码已经接近了;你无法即时打印排序的输出。您可以将计数保存在列表中,对其进行排序然后打印:

def me():
    info = input("what file would you like to select?")
    filehandle = open(info,"r")
    data = filehandle.read()
    case = data.upper()
    s = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
    result = []
    for i in range(26):
        result.append((s[i], case.count(s[i])))
    return result

result = me()
for letter, count in sorted(result, key=lambda x: x[1], reverse=True):
    print(letter, count)

仍在使用您的逻辑,您可以使该功能更具可读性:

import string

def me():
    info = input("what file would you like to select?")
    filehandle = open(info,"r")
    data = filehandle.read()
    case = data.upper()
    result = []
    for letter in string.uppercase:
        result.append((letter, case.count(letter)))
    return result