我有两个python文件来计算单词和频率
import io
import collections
import codecs
from collections import Counter
with io.open('JNb.txt', 'r', encoding='utf8') as infh:
words = infh.read().split()
with open('e1.txt', 'a') as f:
for word, count in Counter(words).most_common(10):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
import io
import collections
import codecs
from collections import Counter
with io.open('JNb.txt', 'r', encoding='utf8') as infh:
for line in infh:
words =line.split()
with open('e1.txt', 'a') as f:
for word, count in Counter(words).most_common(10):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
没有提供输出。
代码不包含语法错误。
输出
താത്കാലിക 1
- 1
ഒഴിവ് 1
അധ്യാപക 1
വാര്ത്തകള് 1
ആലപ്പുഴ 1
ഇന്നത്തെപരിപാടി 1
വിവാഹം 1
അമ്പലപ്പുഴ 1
实际文件包含100个这样的单词。
我没有打印任何东西,我正在写一个文件(e1
)
更新:我尝试了另一个并得到了结果
import collections
import codecs
from collections import Counter
with io.open('JNb.txt', 'r', encoding='utf8') as infh:
words =infh.read().split()
with open('file.txt', 'wb') as f:
for word, count in Counter(words).most_common(10000000):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
它可以在4Gb RAM中计算最多2 GB的文件
这里有什么问题?
答案 0 :(得分:2)
我编写了任务,这是我的解决方案。
我用5.1 GB的文本文件测试了程序。该计划在MBP6.2上完成约20分钟。
如果有任何混淆或建议,请告诉我。祝你好运。
from collections import Counter
import io
import sys
cnt = Counter()
if len(sys.argv) < 2:
print("Provide an input file as argument")
sys.exit()
try:
with io.open(sys.argv[1], 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
cnt[word] += 1
except FileNotFoundError:
print("File not found")
with sys.stdout as f:
total_word_count = sum(cnt.values())
for word, count in cnt.most_common(30):
f.write('{: < 6} {:<7.2%} {}\n'.format(
count, count / total_word_count, word))
输出:
~ python countword.py CSW07.txt
79619 4.58% [n]
63717 3.67% a
56783 3.27% of
42341 2.44% to
40156 2.31% the
39295 2.26% [v]
38231 2.20% [n
36592 2.11% -S]
35250 2.03% or
17113 0.98% in
答案 1 :(得分:0)
你正在为每一行计算单词。 也许尝试阅读整个文件,按字词拆分,并进行计数器调用。
编辑:如果您没有足够的内存来读取所有文件但足以存储所有不同的单词:
import io
import collections
import codecs
from collections import Counter
def count(file):
f = open(file,'r')
cnt = Counter()
for line in f.readlines():
words = line.split(" ")
for word in words:
cnt[word] += 1
f.close()
return cnt
现在让计数器返回并打印以存档您想要的数据。
答案 2 :(得分:0)
您需要读取每一行,将其拆分为单词,然后更新计数器。否则,您只是分别计算每一行。即使文件非常大,因为您只存储单个单词,您将逐行处理它。
请尝试使用此版本:
import collections
import io
c = collections.defaultdict(int)
with io.open('somefile.txt', encoding='utf-8') as f:
for line in f:
if len(line.strip()):
for word in line.split(' '):
c[word] += 1
with io.open('out.txt', 'w') as f:
for word, count in c.iteritems():
f.write('{} {}\n'.format(word, count))