Question

我有一个包含Unicode字符串的文件，我正在计算单词并使用Counter对象

对其进行排序

这是我的代码

import collections
import codecs
from collections import Counter

with io.open('1.txt', 'r', encoding='utf8') as infh:
    words =infh.read()
    Counter(words)
    print Counter(words).most_common(10000)

这是我的1.txt文件

വാര്‍ത്തകള്‍ വാര്‍ത്തകള്‍ വാര്‍ത്തകള്‍  വാര്‍ത്തകള്‍    വാര്‍ത്തകള്‍   വാര്‍ത്തകള്‍   വാര്‍ത്തകള്‍ വാര്‍ത്തകള്‍

它为我提供字符数而不是字数像这样

[(u'\u0d4d', 63), (u'\u0d24', 42), (u'\u200d', 42), (u'\n', 26), (u' ', 21), (u'\u0d30', 21), (u'\u0d33', 21), (u'\u0d35', 21), (u'\u0d15', 21), (u'\u0d3e', 21)]

我的代码有什么问题？

Answer 1

Counter在其构造函数中采用可迭代的元素。 infh.read()返回一个字符串，该字符串在迭代时返回单个字符。相反，您需要提供单词列表：Counter(infh.read().split())。

如果您希望稍后将计数写入文件：

with open('file.txt', 'wb') as f:
    for word, count in Counter(words).most_common(10000):
        f.write(u'{} {}\n'.format(word, count))

计数器对象的麻烦

1 个答案: