比较两个大文件并组合匹配信息

时间:2017-05-24 21:49:47

标签: python json python-3.x csv multiprocessing

我有两个相当大的文件,JSON(185,000行)和CSV(650,000)。我需要遍历JSON文件中的每个dict然后在part_numbers中的每个部分内迭代,并比较它以获得CSV中找到该部分的前三个字母。

出于某种原因,我很难做到这一点。我的脚本的第一个版本太慢了,所以我试图加快它的速度

JSON示例:

[
    {"category": "Dryer Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Dryers"},
    {"category": "Washer Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Washers"},
    {"category": "Sink Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Sinks"},
    {"category": "Other Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Others"}
]

CSV:

WCI|ABC
WPL|DEF
BSH|GHI
WCI|JKL

结尾字典如下所示:

{"category": "Other Parts",
 "part_numbers": ["WCIABC","WPLDEF","BSHGHI","JKLWCI"...]}

以上是我到目前为止所做的一个例子,虽然它在IndexError: list index out of range返回if (part.rstrip() == row[1]):

import csv
import json
from multiprocessing import Pool

def find_part(item):
    data = {
        'parent_category': item['parent_category'],
        'category': item['category'],
        'part_numbers': []
    }

    for part in item['part_numbers']:
        for row in reader:
            if (part.rstrip() == row[1]):
                data['part_numbers'].append(row[0] + row[1])

    with open('output.json', 'a') as outfile:
        outfile.write('    ')
        json.dump(data, outfile)
        outfile.write(',\n')


if __name__ == '__main__':
    catparts = json.load(open('catparts.json', 'r'))
    partfile = open('partfile.csv', 'r')
    reader = csv.reader(partfile, delimiter='|')


    with open('output.json', 'w+') as outfile:
        outfile.write('[\n')

    p = Pool(50)
    p.map(find_part, catparts)

    with open('output.json', 'a') as outfile:
        outfile.write('\n]')

3 个答案:

答案 0 :(得分:1)

我想我找到了它。您的CSV阅读器与许多其他文件访问方法类似:您按顺序读取文件,然后单击EOF。当您尝试对第二部分执行相同操作时,该文件已经处于EOF,并且第一个read尝试返回空结果;这没有第二个元素。

如果要再次访问所有记录,则需要重置文件书签。最简单的方法是使用

寻找字节0
partfile.seek(0)

另一种方法是关闭并重新打开文件。

这会让你感动吗?

答案 1 :(得分:1)

正如我在评论中所说,您的代码(现在)给我NameError: name 'reader'函数中未定义find_part()。解决方法是将csv.reader的创建移动到函数中。我还更改了文件的打开方式,以使用with上下文管理器和newline参数。这也解决了一堆单独的任务都试图同时读取相同的csv文件的问题。

您的方法效率非常低,因为它会为'partfile.csv'中的每个部分读取整个item['part_numbers']文件。然而,以下似乎有效:

import csv
import json
from multiprocessing import Pool

def find_part(item):
    data = {
        'parent_category': item['parent_category'],
        'category': item['category'],
        'part_numbers': []
    }

    for part in item['part_numbers']:
        with open('partfile.csv', newline='') as partfile:  # open csv in Py 3.x
            for row in csv.reader(partfile, delimiter='|'):
                if part.rstrip() == row[1]:
                    data['part_numbers'].append(row[0] + row[1])

    with open('output.json', 'a') as outfile:
        outfile.write('    ')
        json.dump(data, outfile)
        outfile.write(',\n')

if __name__ == '__main__':
    catparts = json.load(open('carparts.json', 'r'))

    with open('output.json', 'w+') as outfile:
        outfile.write('[\n')

    p = Pool(50)
    p.map(find_part, catparts)

    with open('output.json', 'a') as outfile:
        outfile.write(']')

这是一个效率更高的版本,每个子流程只能读取整个'partfile.csv'文件一次

import csv
import json
from multiprocessing import Pool

def find_part(item):
    data = {
        'parent_category': item['parent_category'],
        'category': item['category'],
        'part_numbers': []
    }

    with open('partfile.csv', newline='') as partfile:  # open csv for reading in Py 3.x
        partlist = [row for row in csv.reader(partfile, delimiter='|')]

    for part in item['part_numbers']:
        part = part.rstrip()
        for row in partlist:
            if row[1] == part:
                data['part_numbers'].append(row[0] + row[1])

    with open('output.json', 'a') as outfile:
        outfile.write('    ')
        json.dump(data, outfile)
        outfile.write(',\n')

if __name__ == '__main__':
    catparts = json.load(open('carparts.json', 'r'))

    with open('output.json', 'w+') as outfile:
        outfile.write('[\n')

    p = Pool(50)
    p.map(find_part, catparts)

    with open('output.json', 'a') as outfile:
        outfile.write(']')

虽然您可以在主任务中将'partfile.csv'数据读入内存并将其作为参数传递给find_part()子任务,但这样做只会意味着必须对数据进行pickle和unpickled对于每个过程。您需要运行一些时序测试来确定是否比使用csv模块明确读取它更快,如上所示。

另请注意,在将任务提交到'carparts.json'之前,从Pool文件预处理数据加载并从每行中的第一个elem去除尾随空格也更有效,因为那样你就不会我不需要一遍又一遍地在part = part.rstrip()中进行find_part()。同样,我不知道这样做是否值得付出努力 - 只有时间测试才能确定答案。

答案 2 :(得分:0)

只要csv中存在所有零件编号,这就应该有效。

import json

# read part codes into a dictionary
with open('partfile.csv') as fp:
    partcodes = {}
    for line in fp:
        code, number = line.strip().split('|')
        partcodes[number] = code

with open('catparts.json') as fp:
    catparts = json.load(fp)

# modify the part numbers/codes 
for cat in catparts:
    cat['part_numbers'] = [partcodes[n] + n for n in cat['part_numbers']]

# output
with open('output.json', 'w') as fp:
    json.dump(catparts, fp)