我在文件夹中有100-200个不同名称的文本文件,我想将文件中的文本相互比较,并将类似的文件保存在一个组中。
注意: 1.文件不一样。它们类似于段落中的2-3行与其他文件相同。 2.一个文件可以保存在不同的组中,也可以保存在多个组中
任何人都可以帮助我,因为我是python的初学者吗?
我已尝试过以下代码,但它对我不起作用。
file1=open("F1.txt","r")
file2=open("F2.txt","r")
file3=open("F3.txt","r")
file4=open("F4.txt","r")
file5=open("F5.txt","r")
list1=file1.readlines()
list2=file2.readlines()
list3=file3.readlines()
list4=file4.readlines()
list5=file5.readlines()
for line1 in list1:
for line2 in list2:
for line3 in list3:
for line3 in list4:
for line4 in list5:
if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip():
print line1
file3.write(line1)
答案 0 :(得分:0)
如果我理解你的目的,你应该迭代库中的所有文本文件,并将每个文本文件与另一个(以所有可能的组合)进行比较。代码看起来像这样:
import glob, os
nl = [] #Name list (containing the names of all files in the directory)
fl = [] #File list (containing the content of all files in the directory, each element in this list is a list of strings - the list of lines in a file)
os.chdir("/libwithtextfiles")
for filename in glob.glob("*.txt"): #Using glob to get all the files ending with '.txt'
nl.append(filename) #Appending all the filenames in the directory to 'nl'
f = open(filename, 'r')
fl.append(f.readlines()) #Appending all of the lists of line to 'fl'
f.close()
for fname1 in nl:
l1 = fl[nl.index(fname1)]
if nl.index(fname1) == len(nl) - 1: #We reached the last file
break
for fname2 in nl[nl.index(fname1) + 1:]:
l2 = fl[nl.index(fname2)]
#Here compare the amount of lines identical, use a counter
#then print it, or output to a file or do whatever you want
#with it
#e.g (according to what I understood from your code)
for f1line in l1:
for f2line in l2:
if f1line == f2line: #Why 'in' and not '=='?
"""
have some counter increase right here, a suggestion is having
a list of lists, where the first element is
a list that contains integers
the first integer is the number of lines found identical
between the file (index in list_of_lists is corresponding to the name in that index in 'nl')
and the one following it (index in list_of_lists + 1)
the next integer is the number of lines identical between the same file
and the one following the one following it (+2 this time), etc.
Long story short: list_of_lists[i][j] is the number of lines identical
between the 'i'th file and the 'i+j'th one.
"""
pass
请注意,您的代码不会在应用的地方使用循环,您可能有一个名为l
而非line1 - line5
的列表。
除此之外,您的代码根本不清楚,我认为缺少缩进(for line2 in list2:
应该缩进,包括之后的任何内容)并且for line3 in list3: for line3 in list4: #using line3 twice
是偶然的并且正在将代码复制到此站点。您是将每一行与其他文件中的每一行进行比较?
你应该,正如我在代码中的评论所暗示的那样,有一个计数器来计算该行重复的文件数量(通过将一个for循环与另一个循环嵌套在内部,迭代这些行并仅仅比较两个,而不是所有五个,即使有5
个文件,每个文件都有10
行,你会在100,000
上重复10**5
次 - 而在我的方法中,在这种情况下,您只有1000
次迭代,效率提高100
次。
答案 1 :(得分:0)
您可以使用此代码检查文件之间的类似行:
DS.blit(catImg, catImgRectObj)