我有三个CSV文件,其属性为Product_ID,Name,Cost,Description。每个文件都包含Product_ID。我想将Name(file1),Cost(file2),Description(File3)与Product_ID和以上所有三个属性的新CSV文件组合在一起。我需要高效的代码,因为文件包含超过130000行。
将所有数据合并到新文件后,我必须将这些数据加载到字典中。 例如:Product_Id作为键和名称,成本,描述为值。
答案 0 :(得分:1)
在创建聚合结果之前,将每个输入.csv读入字典可能更有效。
这是一个解决方案,用于读取每个文件并将列存储在以Product_IDs作为键的字典中。我假设每个文件中都存在每个Product_ID值,并且包含了标头。我还假设除了Product_ID之外,文件中没有重复的列。
import csv
from collections import defaultdict
entries = defaultdict(list)
files = ['names.csv', 'costs.csv', 'descriptions.csv']
headers = ['Product_ID']
for filename in files:
with open(filename, 'rU') as f: # Open each file in files.
reader = csv.reader(f) # Create a reader to iterate csv lines
heads = next(reader) # Grab first line (headers)
pk = heads.index(headers[0]) # Get the position of 'Product_ID' in
# the list of headers
# Add the rest of the headers to the list of collected columns (skip 'Product_ID')
headers.extend([x for i,x in enumerate(heads) if i != pk])
for row in reader:
# For each line, add new values (except 'Product_ID') to the
# entries dict with the line's Product_ID value as the key
entries[row[pk]].extend([x for i,x in enumerate(row) if i != pk])
writer = csv.writer(open('result.csv', 'wb')) # Open file to write csv lines
writer.writerow(headers) # Write the headers first
for key, value in entries.items():
writer.writerow([key] + value) # Write the product IDs
# concatenated with the other values
答案 1 :(得分:0)
为遇到处理3个文件的每个id
生成记录(可能不完整)的一般解决方案需要使用专门的数据结构,幸运的是它只是一个列表,具有预先指定的插槽数
d = {id:[name,None,None] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d:
d[id][1] = cost
else:
d[id] = [None, cost, None]
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d:
d[id][2] = desc
else:
d[id] = [None, None, desc]
for id in d:
if all(d[id]):
print ','.join([id]+d[id])
else: # for this id you have not complete info,
# so you have to decide on your own what you want, I have to
pass
如果您确定不想进一步处理不完整记录,可以简化上述代码
d = {id:[name] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d: d[id].append(name)
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d: d[id].append(desc)
for id in d:
if len(d[id])==3: print ','.join([id]+d[id])