我有一些相当大的文本文件(> 2g),我想逐字处理。这些文件是以空格分隔的文本文件,没有换行符(所有单词都在一行中)。我想取每个单词,测试它是否是字典单词(使用附魔),如果是,则将其写入新文件。
这是我现在的代码:
with open('big_file_of_words', 'r') as in_file:
with open('output_file', 'w') as out_file:
words = in_file.read().split(' ')
for word in word:
if d.check(word) == True:
out_file.write("%s " % word)
我查看lazy method for reading big file in python,建议使用yield
来读取数据块,但我担心使用预定大小的块会在中间分割单词。基本上,我希望块只接近指定的大小,而只在空格上分割。有什么建议吗?
答案 0 :(得分:5)
将一个块的最后一个字与下一个的第一个字组合在一起:
def read_words(filename):
last = ""
with open(filename) as inp:
while True:
buf = inp.read(10240)
if not buf:
break
words = (last+buf).split()
last = words.pop()
for word in words:
yield word
yield last
with open('output.txt') as output:
for word in read_words('input.txt'):
if check(word):
output.write("%s " % word)
答案 1 :(得分:1)
您可以通过与re
和mmap
相关联的问题获得类似答案的内容,例如:
import mmap
import re
with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
for word in re.finditer('\w+', mf):
# do something
答案 2 :(得分:0)
yield
。如果一个单词跨越块,那也可以处理。
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return