我正在尝试计算长度介于1和5之间的单词数,文件大小约为4GB结束我收到内存错误。
import os
files = os.listdir('C:/Users/rram/Desktop/')
for file_name in files:
file_path = "C:/Users/rram/Desktop/"+file_name
f = open (file_path, 'r')
text = f.readlines()
update_text = ''
wordcount = {}
for line in text:
arr = line.split("|")
word = arr[13]
if 1<=len(word)<6:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
update_text+= '|'.join(arr)
print (wordcount) #print update_text
print 'closing', file_path, '\t', 'total files' , '\n\n'
f.close()
最后,我在MemoryError
text = f.readlines()
你能帮助优化它吗?
答案 0 :(得分:3)
正如评论中所建议的那样,你应该逐行阅读文件,而不是整个文件。
例如:
count = 0
with open('words.txt','r') as f:
for line in f:
for word in line.split():
if(1 <= len(word) <=5):
count=count+1
print(count)
编辑:
如果您只想计算第14个colomun中的单词并按&#34; |&#34;而是:
count = 0
with open('words.txt','r') as f:
for line in f:
iterator = 0
for word in line.split("|"):
if(1 <= len(word) <=5 and iterator == 13):
count=count+1
iterator = iterator +1
print(count)
请注意,您应该避免写这个
arr = line.split("|")
word = arr[13]
因为该行可能包含少于14个字,这可能导致分段错误。