您好我正在学习Python,出于好奇,我编写了一个程序来删除文件中的额外单词。 我正在比较文件中的测试' text1.txt。和' text2.txt'并根据text1中的测试,我删除了test2中额外的单词。
# Bin/ Python
text1 = open('text1.txt','r')
text2 = open('text2.txt','r')
t_l1 = text1.readlines()
t_l2 = text2.readlines()
# printing to check if the file contents were read properly.
print ' Printing the file 1 contents:'
w_t1 = []
for i in range(len(t_l1)):
w_t1 = t_l1[i].split(' ')
for j in range(len(w_t1)):
print w_t1[j]
#printing to see if the contents were read properly.
print'File 2 contents:'
w_t2 = []
for i in range(len(t_l2)):
w_t2.extend(t_l2[i].split(' '))
for j in range(len(w_t2)):
print w_t2[j]
print 'comparing and deleting the excess variables.'
i = 1
while (i<=len(w_t1)):
if(w_t1[i-1] == w_t2[i-1]):
print w_t1[i-1]
i += 1
# I put all words of file1 in list w_t1 and file2 in list w_t2. Now I am checking if
# each word in w_t1 is same as word in same place of w_t2 if not, i am deleting the
# that word in w_t2 and continuing the while loop.
else:
w.append(str(w_t2[i-1]))
w_t2.remove(w_t2[i-1])
i = i
print 'The extra words are: '+str(w) +'\n'
print w
print 'The original words are: '+ str(w_t2) +'\n'
print 'The extra values are: '
for item in w:
print item
# opening the file out.txt to write the output.
out = open('out.txt', 'w')
out.write(str(w))
# I am closing the files
text1.close()
text2.close()
out.close()
说text1.txt文件中有单词&#34;生日快乐亲爱的朋友&#34; 和text2.txt的关键词是&#34;祝你快乐鼓掌生日,亲爱的好朋友&#34;
程序应该在text2.txt中发出额外的单词,这些单词是&#34; claps,to,you,my,Best&#34;
上面的程序运行正常,但如果我必须为包含数百万字或百万行的文件执行此操作?检查每个单词,这似乎是一个好主意。我们有没有任何Python预定义函数?
P:如果这是一个错误的问题,请耐心等待我,我正在学习python。很快我就不再问这些了。</ p>答案 0 :(得分:3)
这似乎是一个'设定'问题。首先在集结构中添加单词:
textSet1 = set()
with open('text1.txt','r') as text1:
for line in text1:
for word in line.split(' '):
textSet1.add(word)
textSet2 = set()
with open('text2.txt','r') as text2:
for line in text2:
for word in line.split(' '):
textSet2.add(word)
然后只需应用set difference operator
textSet2.difference(textSet1)
给你这个结果
set(['claps', 'to', 'you', 'my', 'Best'])
您可以通过这种方式从以前的结构中获取列表
list(textSet2.difference(textSet1))
['claps', 'to', 'you', 'my', 'Best']
然后,如何阅读here你不应该担心大文件大小,因为使用给定的加载器
当读取下一行时,前一行将被垃圾收集 除非你在其他地方存储了对它的引用
有关延迟文件加载here的更多信息。
最后,在一个真正的问题中,我认为第一组(坏词)具有相对较小的大小,第二组具有大量数据。如果是这种情况,那么您可以避免创建第二组:
diff = []
with open('text2.txt','r') as text2:
for line in text2:
for word in line.split(' '):
if word in textSet1:
diff.append(word)