比较多个文本文件,并保存公用值

时间:2018-08-07 12:33:20

标签: python

我的实际代码:

import os, os.path

DIR_DAT = "dat"
DIR_OUTPUT = "output"
filenames = []

#in case if output folder doesn't exist
if not os.path.exists(DIR_OUTPUT):
    os.makedirs(DIR_OUTPUT)

#isolating empty values from differents contracts
for roots, dir, files in os.walk(DIR_DAT):  
    for filename in files:
        filenames.append("output/" + os.path.splitext(filename)[0] + ".txt")

        filename_input = DIR_DAT + "/" + filename
        filename_output = DIR_OUTPUT + "/" + os.path.splitext(filename)[0] + ".txt"

        with open(filename_input) as infile, open(filename_output, "w") as outfile:
            for line in infile:
                if not line.strip().split("=")[-1]:
                    outfile.write(line)

#creating a single file from all contracts, nb the values are those that are actually empty
with open(DIR_OUTPUT + "/all_agreements.txt", "w") as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

#finale file with commons empty data
#creating a single file
with open(DIR_OUTPUT + "/all_agreements.txt") as infile, open(DIR_OUTPUT + "/results.txt", "w") as outfile:
    seen = set()
    for line in infile:
        line_lower = line.lower()
        if line_lower in seen:
            outfile.write(line)
        else:
            seen.add(line_lower)

print("Psst go check in the ouptut folder ;)")

我的代码的最后几行是否在检查中,元素存在多个时间。因此,该元素可能存在一次,两次,三次,四次。它将添加到results.txt。

但是问题是,我只想将它保存在results.txt中4次。

或者最好的情况是,比较这4个.txt文件并将公用的元素保存到results.txt中。

但是我解决不了。

感谢您的帮助:)


为了简化操作,

with open(DIR_OUTPUT + "/all_agreements.txt") as infile, open(DIR_OUTPUT + "/results.txt", "w") as outfile:
    seen = set()
    for line in infile:
        if line in seen:
            outfile.write(line)
        else:
            seen.add(line)

在哪里可以使用.count()函数? 因为我想做类似xxx.count(line)== 4的操作,然后将其保存到resulsts.txt

2 个答案:

答案 0 :(得分:-1)

如果文件不是很大,则可以使用set.intersection(a,b,c,d)

data = []
for fname in filenames:
    current = set()
    with open(fname) as infile:
        for line in infile:
            current.add(line)
    data.append(current)

results = set.intersection(*data)

您也不必为此创建一个大文件。

答案 1 :(得分:-1)

不确定输入的外观或预期的输出...

但这也许可以激发一些想法:

from io import StringIO
from collections import Counter

lines = ["""\
a=This
b=is
c=a Test
""", """\
a=This
b=is
c=a Demonstration
""", """\
a=This
b=is
c=another
d=example
""", """\
a=This
b=is
c=so much
d=fun
"""]

files = (StringIO(l) for l in lines)

C = Counter(line for f in files for line in f)

print([k for k,v in C.items() if v >= 4])
# Output: ['a=This\n', 'b=is\n']