我正在使用一个非常大的文本文件(500MB +),我的代码输出完美,但我得到了很多重复。我要做的是检查输出文件,看它是否在写入文件之前存在。我确信它只是if语句中的一行,但我不太了解python并且无法弄清楚语法。任何帮助将不胜感激。
以下是代码:
authorList = ['Shakes.','Scott']
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
for x in authorList:
start_position = 0
while True:
start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
if start_position < 0:
break
end_position = the_whole_file.find('</W>', start_position)
output_file.write(the_whole_file[start_position:end_position+4])
output_file.write("\n")
start_position = end_position + 4
答案 0 :(得分:1)
我建议您只是跟踪您已经看过哪些作者数据,并且只有在您之前没有看过它时才写出来。您可以使用dict
跟踪。
authorList = ['Shakes.','Scott']
already_seen = {} # dict to keep track of what has been seen
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
for x in authorList:
start_position = 0
while True:
start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
if start_position < 0:
break
end_position = the_whole_file.find('</W>', start_position)
author_data = the_whole_file[start_position:end_position+4]
if author_data not in already_seen:
output_file.write(author_data + "\n")
already_seen[author_data] = True
start_position = end_position + 4
答案 1 :(得分:0)
创建一个包含要写入的每个字符串的列表。如果您追加它,请先检查您追加的项目是否已在列表中。
答案 2 :(得分:0)
我的理解是,当你想写入output_file时,你希望跳过包含作者姓名的open_file中的行。如果这是你打算做的,那就这样做吧。
authorList = ['Shakes.','Scott']
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
for line in open_file:
skip = 0
for author in authorList:
if author in line:
skip = 1
if not skip:
output_file.write(line)
答案 3 :(得分:0)
我认为您应该使用适当的工具处理文件以处理文本:正则表达式。
import re
regx = re.compile('<A>(.+?)</A>.*?<W>.*?</W>')
with open('/Users/Desktop/2e.txt','rb') as open_file,\
open('/Users/Desktop/Poetrylist.txt','wb') as output_file:
remain = ''
seen = set()
while True:
chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
if not chunk: break
for mat in regx.finditer(remain + chunk):
if mat.group(1) not in seen:
output_file.write( mat.group() + '\n' )
seen.add(mat.group(1))
remain = chunk[mat.end(0)-len(remain):]