使用指定的列优化大文件比较

时间:2018-07-27 06:54:51

标签: python python-2.7

我有以下详细信息可以进行文件比较。

文件详细信息:

enter image description here

注意:文件的每个属性都是动态的,并且File2是File1的附加数据/增量数据。

要求:

  1. 我想对File1和File2进行比较,并将差异数据存储到具有File2定界符(|)和File2列(80)的第三个文件 Output.txt中。 File2具有File1的所有数据加上最近附加的数据。

  2. File1和File2具有2个称为ID的唯一列,并且Date和这两个列的列索引可能不同。我想对这两列(ID,日期)进行比较。

  3. 如果我在日期列中发现了其他任何字符,那么我想将该记录存储在Error.txt文件中。

  4. 定界符是动态的。

  5. 列数是动态的。

  6. 列索引是动态的。

  7. 在输出文件中,我将获得差分数据的结果,即数据的0.2GB

我的尝试:我尝试使用以下代码,但该代码一直运行,无法获得结果。

from __future__ import print_function
import dateutil.parser as dparser
from dateutil.parser import parse

file1 = 'E:\File1.txt' 
file2 = 'E:\File2.txt' 
file3 = 'E:\OUTFile.txt' 
file4 = 'E:\Errors.txt'

with open(file1, 'r') as f1:
    firstline = f1.readline()
    print('File1 Header:',firstline.strip('\n'))

file1_delimiter = raw_input('Please provide the delimiter:')

with open(file2, 'r') as f2:
    firstline = f2.readline()
    print('\nFile2 Header:',firstline.strip('\n'))

file2_delimiter = raw_input('Please provide the delimiter:')

with open(file1, 'r') as f1:
    header = f1.readline()
    headerList1 = list(header.split(file1_delimiter))
    print('\n---File1, Column Index with Header---')
    for item in headerList1:
        print(headerList1.index(item),item)

    file1_header1 = input("Enter column1 number:")
    file1_header2 = input("Enter column2 number:")

with open(file2, 'r') as f2:
    header = f2.readline()
    headerList = list(header.split(file2_delimiter))
    print('\n---File2, Column Index with Header---')
    for item in headerList:
        print(headerList.index(item), item)

    file2_header1 = input("Enter column1 number:")
    file2_header2 = input("Enter column2 number:")

file_1_set1 = set()
file_2_set1 = set()
file_1_set2 = set()
file_2_set2 = set()

def is_date(string):
    try:
        parse(string)
        return True
    except ValueError:
        return False

with open(file1, 'r') as f_1:
    lines = f_1.readlines()[1:]
    f_1_result = []
    for x in lines:
        if x.split(file2_delimiter)[file1_header2]:
            file_1_set1.add(x.split(file1_delimiter)[file1_header1].strip('\n'))
            if is_date(x.split(file2_delimiter)[file1_header2]) == True:
                file_1_set2.add(str(dparser.parse(x.split(file1_delimiter)[file1_header2].strip('\n'),fuzzy=True).date()))

with open(file2, 'r') as f_2:
    lines = f_2.readlines()[1:]
    f_2_result = []
    for x in lines:
        if x.split(file2_delimiter)[file2_header2]:
            file_2_set1.add(x.split(file2_delimiter)[file2_header1].strip('\n'))
            if is_date(x.split(file2_delimiter)[file2_header2]) == True:
                file_2_set2.add(str((dparser.parse(x.split(file2_delimiter)[file2_header2].strip('\n'),fuzzy=True).date())))

with open(file2, 'r') as in_file, open(file3, 'w') as out_file, open(file4, 'w') as err:
    out_file.write(next(in_file))
    set1_diff = (file_2_set1 - file_1_set1)
    set2_diff = (file_2_set2 - file_1_set2)

    for line in in_file:
        if line.split(file2_delimiter)[file2_header2]:
            if is_date(line.split(file2_delimiter)[file2_header2]) == True:
                if line.split(file2_delimiter)[file2_header1] in set1_diff or str(dparser.parse(line.split(file2_delimiter)[file2_header2].strip('\n'),fuzzy=True).date()) in set2_diff:
                    out_file.write(line)
            else:
                err.write(line)

样本数据:

文件1:

ID^MICNO^Name^Dt^MidName^Address^Permanent Address^ASID^E-mail ID^Gender^Nationality^Subscriber Details^D No^UMO No^Type^DType^S Subscriber^CType^FormA^POU No^SSAP^CD^Date
223344^^Jak . .^^MAK^HNo 123 USA^    -^^^^^^^^^^^TM^^^^^14-04-2012
56432178^^David . .^^Koustry^HNo 366 UK^    -^^^^^Ink Olk^^^^^^TOM^^^^^23-02-2015
3241567890^^Simon . .^^Plourd^HNo 233 UAE^    -^^^^^^^^^^^TMM^^^^^28-07-2016

文件2:

ID^MICNO^Name^Dt^MidName^Address^Permanent Address^ASID^E-mail ID^Gender^Nationality^Subscriber Details^D No^UMO No^Type^DType^S Subscriber^CType^FormA^POU No^SSAP^CD^Date
12334^^Brod . .^^Plaku^HNo 5400 CAN^    -^^^^^^^^^^^TM^^^^^14-04-2012
56432178^^David . .^^Koustry^HNo 366 UK^    -^^^^^Ink Olk^^^^^^TOM^^^^^23-02-2015
3241567890^^Simon . .^^Plourd^HNo 233 UAE^    -^^^^^^^^^^^TMM^^^^^28-07-2017

输出文件应具有:

ID^MICNO^Name^Dt^MidName^Address^Permanent Address^ASID^E-mail ID^Gender^Nationality^Subscriber Details^D No^UMO No^Type^DType^S Subscriber^CType^FormA^POU No^SSAP^CD^Date
12334^^Brod . .^^Plaku^HNo 5400 CAN^    -^^^^^^^^^^^TM^^^^^14-04-2012
3241567890^^Simon . .^^Plourd^HNo 233 UAE^    -^^^^^^^^^^^TMM^^^^^28-07-2017

2 个答案:

答案 0 :(得分:1)

由于我无法在3Gb文件上进行测试,因此不确定它是否满足您的要求。我对您的代码做了一些改进:

  • 我将完整密钥存储在一个集中,这意味着文件1中的对(ID,日期)
  • 我只读取一次每个文件
  • 我一次只在两个输入文件的内存中存储一​​行

这是我的代码:

from __future__ import print_function
import dateutil.parser as dparser
from dateutil.parser import parse
import csv

file1 = 'E:\File1.txt' 
file2 = 'E:\File2.txt' 
file3 = 'E:\OUTFile.txt' 
file4 = 'E:\Errors.txt'

with open(file1, 'r') as f1:
    firstline = f1.readline()
    print('File1 Header:',firstline.strip('\n'))

file1_delimiter = raw_input('Please provide the delimiter:')

with open(file2, 'r') as f2:
    firstline = f2.readline()
    print('\nFile2 Header:',firstline.strip('\n'))

file2_delimiter = raw_input('Please provide the delimiter:')

file_1_set = set()
file_1_set = set()

def is_date(string):
    print(string)
    try:
        parse(string)
        return True
    except ValueError:
        return False

with open(file1, 'r') as f1, open(file2, 'r') as f2:
    rd1 = csv.reader(f1, delimiter = file1_delimiter)
    headerList1 = next(rd1)
    print('\n---File1, Column Index with Header---')
    for i, item in enumerate(headerList1):
        print(i,item)

    file1_header1 = input("Enter column1 number:")
    file1_header2 = input("Enter column2 number:")

    rd2 = csv.reader(f2, delimiter = file2_delimiter)
    headerList2 = next(rd2)
    print('\n---File2, Column Index with Header---')
    for i, item in enumerate(headerList2):
        print(i, item)

    file2_header1 = input("Enter column1 number:")
    file2_header2 = input("Enter column2 number:")

    for x in rd1:
        if x[file1_header1] and is_date(x[file1_header2]):
            file_1_set.add((x[file1_header1], dparser.parse(x[file1_header2],
                            fuzzy=True).date()))

    with open(file3, 'wb') as out_file, open(file4, 'wb') as err:
        out_wr = csv.writer(out_file, delimiter = file2_delimiter)
        err_wr = csv.writer(err, delimiter = file2_delimiter)
        out_wr.writerow(headerList2)
        f_2_result = []
        for x in rd2:
            if not is_date(x[file2_header2]):
                err_wr.writerow(x)
            elif x[file2_header1] and ((x[file1_header1], dparser.parse(x[file1_header2],
                                fuzzy=True).date()) not in file_1_set):
                out_wr.writerow(x)

从文件示例中,您的输入文件确实包含字段名称。在这种情况下,您可以只使用DictReader直接处理第一行中的字段名称。

正如您所说的那样,该程序一直在运行,没有任何输出,我建议每n行打印一个点。对于3Gb文件,每10000行1个点应该是2个点之间的时间过长与太多点之间的可接受的混合。代码变为:

from __future__ import print_function
import dateutil.parser as dparser
from dateutil.parser import parse
import csv

file1 = 'E:\File1.txt' 
file2 = 'E:\File2.txt' 
file3 = 'E:\OUTFile.txt' 
file4 = 'E:\Errors.txt'

delta = 10000          # one dot on stderr at every 10000th line

with open(file1, 'r') as f1:
    firstline = f1.readline()
    print('File1 Header:',firstline.strip('\n'))

file1_delimiter = raw_input('Please provide the delimiter:')

with open(file2, 'r') as f2:
    firstline = f2.readline()
    print('\nFile2 Header:',firstline.strip('\n'))

file2_delimiter = raw_input('Please provide the delimiter:')

file_1_set = set()
file_1_set = set()

def is_date(string):
    try:
        parse(string)
        return True
    except ValueError:
        return False

with open(file1, 'r') as f1, open(file2, 'r') as f2:
    rd1 = csv.DictReader(f1, delimiter = file1_delimiter)

    if hid not in rd1.fieldnames or hdate not in rd1.fieldnames:
        raise KeyError("File1 does not contain ID and Date fields")
    rd2 = csv.DictReader(f2, delimiter = file2_delimiter)
    if hid not in rd2.fieldnames or hdate not in rd2.fieldnames:
        raise KeyError("File2 does not contain ID and Date fields")

    numlig = 0
    _ = sys.stderr.write("Processing file1:")
    for x in rd1:
        if x[hid] and is_date(x[hdate]):
            file_1_set.add((x[hid], dparser.parse(x[hdate],
                            fuzzy=True).date()))
            numlig +=1
            if numlig >= delta:
                _ = sys.stderr.write('.')
                numlig = 0

    with open(file3, 'wb') as out_file, open(file4, 'wb') as err:
        out_wr = csv.DictWriter(out_file, fieldnames = rd2.fieldnames,
                                delimiter = file2_delimiter)
        err_wr = csv.DictWriter(err, fieldnames = rd2.fieldnames,
                            delimiter = file2_delimiter)
        out_wr.writeheader()
        numlig = 0
        _ = sys.stderr.write("\nProcessing file2:")
        for x in rd2:
            if not is_date(x[hdate]):
                err_wr.writerow(x)
            elif x[hid] and ((x[hid], dparser.parse(x[hdate],
                                fuzzy=True).date()) not in file_1_set):
                out_wr.writerow(x)
            numlig += 1
            if numlig >= delta:
                _ = sys.stderr.write('.')
                numlig = 0

答案 1 :(得分:0)

我的答案类似于@Serge's,但我认为您可以做出其他改进。

一方面,您的代码片段将日期显示为格式非常一致的字符串。由于您只对比较是否相等感兴趣,因此无需对其进行任何(昂贵的)转换。特别是没有两次。您可以在数GB的处理工作中获得的任何改进都是值得的。相反,我建议您使用简单的正则表达式甚至手动检查之类的东西:

def is_date_simple(string):
    return (len(string) == 10 and string[2] == '-' and string[5] == '-' and
            string[:2].isdigit() and string[3:5].isdigit() and
            string[6:].isdigit())

OR

import re

...

def is_date_regex(string, date_pattern=re.compile('\d\d-\d\d-\d\d\d\d')):
    return date_pattern.fullmatch(string)

如果检查通过,请按原样使用日期。

另一方面,您可以自动识别感兴趣的列,因为它们的名称始终为IDDate

就代码优美性而言,我建议使用以下事实:文件可直接在其行上进行迭代。 不要在大型文件上使用readlines!这样一来,整个事情就全部加载到了内存中,而不是使用Python的内置缓冲机制。

作为挑剔的人,请勿进行if ... == True:True是单例,应该与效率更高的... is True进行比较。但是,如果您使用的是if语句,则完全不进行比较。 if ...:足够且更好,因为它会自动将...转换为bool

这是构造初始索引集的方式:

def initialize(file, delim):
    """
    Creates an iterator, moves it past the header and gets the indices of
    the id and date columns.

    Returns the iterator, the raw header line, the id and date column indices.
    """
    iterator = iter(file)
    header_line = next(iterator)
    header = [x.casefold() for x in header_line.split(delim)]
    index_id = header.index('id')
    index_date = header.index('date')
    return iterator, header_line, index_id, index_date

def index(line, delim, index_id, index_date):
    line = line.split(delim)
    date = line[index_date]
    if not is_date(date):   # Use one of the implementations show here.
        return None
    return line[index_id], date

delim1 = '^'
with open(file1, 'r') as f1:
    iterator, _, index_id, index_date = initialize(f1, delim1)
    key = lambda line: index(line, delim1, index_id, index_date)
    existing_set = set(filter(None, key(x) for x in iterator))  # That's it. Whole file1 indexed

在这里,我假设您的行将不包含任何带引号的定界符。您的样本数据表明这是一个安全的假设。上面显示的两个函数的原因是,它们对于从第二个文件中获取相同的信息将非常有用:

delim2 = '|'
with open(file4, 'a') as err, open(file3, 'a') as output, open(file2, 'r') as f2:
    iterator, header, index_id, index_date = initialize(f2)
    print(header, file=output)
    for line in iterator:
        key = index(line, delim2, index_id, index_date)
        if not key:
            print(line, file=err)
        elif key not in existing_set:
            print(line, file=output)

如果您不想在输出文件中重复file2的标题,请删除行print(header, file=output)

请注意,我选择不为此应用程序使用csv模块。当您想详细解析数据时,此模块非常有用,但它比仅在字符串上调用split更为昂贵。您的CSV文件非常简单,您仅尝试使用两个元素作为键。您还试图减少所有开销。