file1.csv contains 2 columns: c11;c12
file2.csv contains 2 columns: c21;c22
Common column: c11, c21
示例:
f1.csv
a;text_a
b;text_b
f;text_f
x;text_x
f2.csv
a;path_a
c;path_c
d;path_d
k;path_k
l;path_l
m:path_m
输出f1 + f2:
a;text_a;path_a
b;text_b,''
c;'';path_c
d;'';path_d
f;text_f;''
k;'';path_k
l;'';path_l
m;'';path_m
x;text_x;''
如何使用python实现它?
答案 0 :(得分:3)
使用csv模块很容易做到:
import csv
with open('file1.csv') as f:
r = csv.reader(f, delimiter=';')
dict1 = {row[0]: row[1] for row in r}
with open('file2.csv') as f:
r = csv.reader(f, delimiter=';')
dict2 = {row[0]: row[1] for row in r}
keys = set(dict1.keys() + dict2.keys())
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter=';')
w.writerows([[key, dict1.get(key, "''"), dict2.get(key, "''")]
for key in keys])
答案 1 :(得分:0)
为了基于一个或多个公共列合并多个文件(甚至> 2),python中最好和最有效的方法之一就是使用“brewery”。您甚至可以指定合并时需要考虑哪些字段以及需要保存哪些字段。
import brewery
from brewery
import ds
import sys
sources = [
{"file": "grants_2008.csv",
"fields": ["receiver", "amount", "date"]},
{"file": "grants_2009.csv",
"fields": ["id", "receiver", "amount", "contract_number", "date"]},
{"file": "grants_2010.csv",
"fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]
创建所有字段的列表并添加文件名以存储有关数据记录来源的信息。浏览源定义并收集字段:
for source in sources:
for field in source["fields"]:
if field not in all_fields:
out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()
for source in sources:
path = source["file"]
# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.
src = ds.CSVDataSource(path,read_header=False,skip_rows=1)
src.fields = ds.FieldList(source["fields"])
src.initialize()
for record in src.records():
# Add file reference into ouput - to know where the row comes from
record["file"] = path
out.append(record)
# Close the source stream
src.finalize()
cat merged.csv | brewery pipe pretty_printer