Python 3.7:对大数据文件进行性能调优比较

时间:2018-06-04 10:40:50

标签: python python-3.x

我有两个大小每个3 GB的csv文件来比较并存储第三个差异 文件。

Python代码:

with open('JUN-01.csv', 'r') as f1:
    file1 = f1.readlines()

with open('JUN-02.csv', 'r') as f2:
    file2 = f2.readlines()

with open('JUN_Updates.csv', 'w') as outFile:
    outFile.write(file1[0])
    for line in file2:
        if line not in file1:
            outFile.write(line)

执行时间: 45分钟并且仍在运行......

2 个答案:

答案 0 :(得分:4)

不确定是否已经来不及了,但是来了。

我看到您正在将2个数组和完整文件加载到内存中。如果您说它们每个大约3 GB,那就是试图在RAM中填充6 GB并可能进入交换。

此外,即使成功加载文件,也要尝试〜L1xL2字符串比较(L1和L2为行数)。

我已在1.2 GB(330万行)中运行以下代码,并在几秒钟内完成。它使用字符串哈希,并且仅在RAM中加载一组L1 integer32。

技巧就在这里完成,将hashstring函数应用于文件中的每一行之后(除了标题外,您似乎要添加到输出中)创建一个set()。

file1 = set(map(hashstring, f1))

请注意,我正在将文件与其自身进行比较(f2加载与f1相同的文件)。让我知道是否有帮助。

from zlib import adler32

def hashstring(s):
    return adler32(s.encode('utf-8'))

with open('haproxy.log.1', 'r') as f1:
    heading = f1.readline()
    print(f'Heading: {heading}')
    print('Hashing')
    file1 = set(map(hashstring, f1))
    print(f'Hashed: {len(file1)}')

with open('updates.log', 'w') as outFile:
    count = 0
    outFile.write(heading)
    with open ('haproxy.log.1', 'r') as f2:
        for line in f2:
            if hashstring(line) not in file1:
                outFile.write(line)
            count += 1
            if 0 == count % 10000:
                print(f'Checked: {count}')

答案 1 :(得分:0)

如果difflib有助于提高效率,请尝试以下操作: -

import difflib
import sys

with open('JUN_Updates.csv', 'w') as differenceFile:
    with open('JUN-01.csv', 'r') as june1File:
        with open('JUN-02.csv', 'r') as june2File:
            diff = difflib.unified_diff(
                june1File.readlines(),
                june2File.readlines(),
                fromfile='june1File',
                tofile='june2File',
            )

            lines = list(diff)[2:]
            added = [line[1:] for line in lines if line[0] == '+']
            removed = [line[1:] for line in lines if line[0] == '-']

            for line in added:
                differenceFile.write(line)