整合多个文本文件并将一组中的类似文件分组

时间:2016-01-15 07:33:02

标签: python

我在文件夹中有100-200个不同名称的文本文件,我想将文件中的文本相互比较,并将类似的文件保存在一个组中。

注意: 1.文件不一样。它们类似于段落中的2-3行与其他文件相同。 2.一个文件可以保存在不同的组中,也可以保存在多个组中

任何人都可以帮助我,因为我是python的初学者吗?

我已尝试过以下代码,但它对我不起作用。

file1=open("F1.txt","r")
file2=open("F2.txt","r")
file3=open("F3.txt","r")
file4=open("F4.txt","r")
file5=open("F5.txt","r")
list1=file1.readlines()
list2=file2.readlines()
list3=file3.readlines()
list4=file4.readlines()
list5=file5.readlines()
for line1 in list1:
for line2 in list2:
    for line3 in list3:
        for line3 in list4:
            for line4 in list5:
                if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip():
                    print line1
                    file3.write(line1)

2 个答案:

答案 0 :(得分:0)

如果我理解你的目的,你应该迭代库中的所有文本文件,并将每个文本文件与另一个(以所有可能的组合)进行比较。代码看起来像这样:

import glob, os
nl = [] #Name list (containing the names of all files in the directory)
fl = [] #File list (containing the content of all files in the directory, each element in this list is a list of strings - the list of lines in a file)
os.chdir("/libwithtextfiles")
for filename in glob.glob("*.txt"): #Using glob to get all the files ending with '.txt'
    nl.append(filename) #Appending all the filenames in the directory to 'nl'
    f = open(filename, 'r')
    fl.append(f.readlines()) #Appending all of the lists of line to 'fl'
    f.close()
for fname1 in nl:
    l1 = fl[nl.index(fname1)]
    if nl.index(fname1) == len(nl) - 1: #We reached the last file
        break
    for fname2 in nl[nl.index(fname1) + 1:]:
        l2 = fl[nl.index(fname2)]
        #Here compare the amount of lines identical, use a counter
        #then print it, or output to a file or do whatever you want
        #with it
        #e.g (according to what I understood from your code)
        for f1line in l1:
            for f2line in l2:
                if f1line == f2line: #Why 'in' and not '=='?
                    """
                    have some counter increase right here, a suggestion is having
                    a list of lists, where the first element is 
                    a list that contains integers
                    the first integer is the number of lines found identical 
                    between the file (index in list_of_lists is corresponding to the name in that index in 'nl') 
                    and the one following it (index in list_of_lists + 1)
                    the next integer is the number of lines identical between the same file
                    and the one following the one following it (+2 this time), etc.

                    Long story short: list_of_lists[i][j] is the number of lines identical 
                    between the 'i'th file and the 'i+j'th one.
                    """
                    pass

请注意,您的代码不会在应用的地方使用循环,您可能有一个名为l而非line1 - line5的列表。

除此之外,您的代码根本不清楚,我认为缺少缩进(for line2 in list2:应该缩进,包括之后的任何内容)并且for line3 in list3: for line3 in list4: #using line3 twice是偶然的并且正在将代码复制到此站点。您是将每一行与其他文件中的每一行进行比较?

你应该,正如我在代码中的评论所暗示的那样,有一个计数器来计算该行重复的文件数量(通过将一个for循环与另一个循环嵌套在内部,迭代这些行并仅仅比较两个,而不是所有五个,即使有5个文件,每个文件都有10行,你会在100,000上重复10**5次 - 而在我的方法中,在这种情况下,您只有1000次迭代,效率提高100次。

答案 1 :(得分:0)

您可以使用此代码检查文件之间的类似行:

DS.blit(catImg, catImgRectObj)