我有两个相当大的文件,JSON(185,000行)和CSV(650,000)。我需要遍历JSON文件中的每个dict然后在part_numbers
中的每个部分内迭代,并比较它以获得CSV中找到该部分的前三个字母。
出于某种原因,我很难做到这一点。我的脚本的第一个版本太慢了,所以我试图加快它的速度
JSON示例:
[
{"category": "Dryer Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Dryers"},
{"category": "Washer Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Washers"},
{"category": "Sink Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Sinks"},
{"category": "Other Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Others"}
]
CSV:
WCI|ABC
WPL|DEF
BSH|GHI
WCI|JKL
结尾字典如下所示:
{"category": "Other Parts",
"part_numbers": ["WCIABC","WPLDEF","BSHGHI","JKLWCI"...]}
以上是我到目前为止所做的一个例子,虽然它在IndexError: list index out of range
返回if (part.rstrip() == row[1]):
:
import csv
import json
from multiprocessing import Pool
def find_part(item):
data = {
'parent_category': item['parent_category'],
'category': item['category'],
'part_numbers': []
}
for part in item['part_numbers']:
for row in reader:
if (part.rstrip() == row[1]):
data['part_numbers'].append(row[0] + row[1])
with open('output.json', 'a') as outfile:
outfile.write(' ')
json.dump(data, outfile)
outfile.write(',\n')
if __name__ == '__main__':
catparts = json.load(open('catparts.json', 'r'))
partfile = open('partfile.csv', 'r')
reader = csv.reader(partfile, delimiter='|')
with open('output.json', 'w+') as outfile:
outfile.write('[\n')
p = Pool(50)
p.map(find_part, catparts)
with open('output.json', 'a') as outfile:
outfile.write('\n]')
答案 0 :(得分:1)
我想我找到了它。您的CSV阅读器与许多其他文件访问方法类似:您按顺序读取文件,然后单击EOF。当您尝试对第二部分执行相同操作时,该文件已经处于EOF,并且第一个read
尝试返回空结果;这没有第二个元素。
如果要再次访问所有记录,则需要重置文件书签。最简单的方法是使用
寻找字节0partfile.seek(0)
另一种方法是关闭并重新打开文件。
这会让你感动吗?
答案 1 :(得分:1)
正如我在评论中所说,您的代码(现在)给我NameError: name 'reader'
函数中未定义find_part()
。解决方法是将csv.reader
的创建移动到函数中。我还更改了文件的打开方式,以使用with
上下文管理器和newline
参数。这也解决了一堆单独的任务都试图同时读取相同的csv文件的问题。
您的方法效率非常低,因为它会为'partfile.csv'
中的每个部分读取整个item['part_numbers']
文件。然而,以下似乎有效:
import csv
import json
from multiprocessing import Pool
def find_part(item):
data = {
'parent_category': item['parent_category'],
'category': item['category'],
'part_numbers': []
}
for part in item['part_numbers']:
with open('partfile.csv', newline='') as partfile: # open csv in Py 3.x
for row in csv.reader(partfile, delimiter='|'):
if part.rstrip() == row[1]:
data['part_numbers'].append(row[0] + row[1])
with open('output.json', 'a') as outfile:
outfile.write(' ')
json.dump(data, outfile)
outfile.write(',\n')
if __name__ == '__main__':
catparts = json.load(open('carparts.json', 'r'))
with open('output.json', 'w+') as outfile:
outfile.write('[\n')
p = Pool(50)
p.map(find_part, catparts)
with open('output.json', 'a') as outfile:
outfile.write(']')
这是一个效率更高的版本,每个子流程只能读取整个'partfile.csv'
文件一次:
import csv
import json
from multiprocessing import Pool
def find_part(item):
data = {
'parent_category': item['parent_category'],
'category': item['category'],
'part_numbers': []
}
with open('partfile.csv', newline='') as partfile: # open csv for reading in Py 3.x
partlist = [row for row in csv.reader(partfile, delimiter='|')]
for part in item['part_numbers']:
part = part.rstrip()
for row in partlist:
if row[1] == part:
data['part_numbers'].append(row[0] + row[1])
with open('output.json', 'a') as outfile:
outfile.write(' ')
json.dump(data, outfile)
outfile.write(',\n')
if __name__ == '__main__':
catparts = json.load(open('carparts.json', 'r'))
with open('output.json', 'w+') as outfile:
outfile.write('[\n')
p = Pool(50)
p.map(find_part, catparts)
with open('output.json', 'a') as outfile:
outfile.write(']')
虽然您可以在主任务中将'partfile.csv'
数据读入内存并将其作为参数传递给find_part()
子任务,但这样做只会意味着必须对数据进行pickle和unpickled对于每个过程。您需要运行一些时序测试来确定是否比使用csv
模块明确读取它更快,如上所示。
另请注意,在将任务提交到'carparts.json'
之前,从Pool
文件预处理数据加载并从每行中的第一个elem去除尾随空格也更有效,因为那样你就不会我不需要一遍又一遍地在part = part.rstrip()
中进行find_part()
。同样,我不知道这样做是否值得付出努力 - 只有时间测试才能确定答案。
答案 2 :(得分:0)
只要csv中存在所有零件编号,这就应该有效。
import json
# read part codes into a dictionary
with open('partfile.csv') as fp:
partcodes = {}
for line in fp:
code, number = line.strip().split('|')
partcodes[number] = code
with open('catparts.json') as fp:
catparts = json.load(fp)
# modify the part numbers/codes
for cat in catparts:
cat['part_numbers'] = [partcodes[n] + n for n in cat['part_numbers']]
# output
with open('output.json', 'w') as fp:
json.dump(catparts, fp)