我有CSV文件,其格式如下:
file1.csv
ID,NAME
001,Jhon
002,Doe
fille2.csv
ID,SCHOOLS_ATTENDED
001,my Nice School
002,His lovely school
file3.csv
ID,SALARY
001,25
002,40
ID 字段是一种用于获取记录的主键。
读取3到4个文件并获取相应数据并存储在另一个带有标题的CSV文件(ID,NAME,SCHOOLS_ATTENDED,SALARY)中的最有效方法是什么?
文件大小为数百MB(100,200 Mb)。
答案 0 :(得分:3)
数百兆不是那么多。为什么不使用csv
module和collections.defaultdict
:
import csv
from collections import defaultdict
result = defaultdict(dict)
fieldnames = {"ID"}
for csvfile in ("file1.csv", "file2.csv", "file3.csv"):
with open(csvfile, newline="") as infile:
reader = csv.DictReader(infile)
for row in reader:
id = row.pop("ID")
for key in row:
fieldnames.add(key) # wasteful, but I don't care enough
result[id][key] = row[key]
结果defaultdict
如下所示:
>>> result
defaultdict(<type 'dict'>,
{'001': {'SALARY': '25', 'SCHOOLS_ATTENDED': 'my Nice School', 'NAME': 'Jhon'},
'002': {'SALARY': '40', 'SCHOOLS_ATTENDED': 'His lovely school', 'NAME': 'Doe'}})
然后你可以把它组合成一个CSV文件(不是我最漂亮的工作,但现在还不错):
with open("out.csv", "w", newline="") as outfile:
writer = csv.DictWriter(outfile, sorted(fieldnames))
writer.writeheader()
for item in result:
result[item]["ID"] = item
writer.writerow(result[item])
out.csv
然后包含
ID,NAME,SALARY,SCHOOLS_ATTENDED
001,Jhon,25,my Nice School
002,Doe,40,His lovely school
答案 1 :(得分:0)
以下是将多个csv文件与其名称中的特定关键字组合成1个最终csv文件的工作代码。我已将default关键字设置为“file”,但如果您想要组合folder_path中的所有csv文件,则可以将其设置为空白。此代码将从您的第一个csv文件中获取标头,并将其用作最终组合csv文件中的标头。它将忽略所有其他csv文件的标头。
import glob,os
@staticmethod
def Combine_multiple_csv_files_thatContainsKeywordInTheirNames_into_one_csv_file(folder_path,keyword='file'):
#takes header only from 1st csv, all other csv headers are skipped and data is appened to final csv
fileNames = glob.glob(folder_path + "*" + keyword + "*"+".csv") # fileNames INCLUDES FOLDER_PATH TOO
with open(folder_path+"Combined_csv.csv", "w", newline='') as fout:
print('Combining multiple csv files into 1')
csv_write_file = csv.writer(fout, delimiter=',')
# a.writerows(op)
with open(fileNames[0], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
csv_write_file.writerows(csv_read_file)
for num in range(1, len(fileNames)):
with open(fileNames[num], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
next(csv_read_file) # ignore header
csv_write_file.writerows(csv_read_file)