我需要维护一个主CSV文件(子文件),该文件从多个其他文件(父文件)中的不同行填充。所有父文件都具有相同的布局。子文件将在末尾添加2个附加列(文件名列[数据来自的位置]和时间[附加时间])。
*编辑:更新代码:
import glob, os, csv, time
DI = 'E:\Python\Test\MergeFilesIn'
FO = 'E:\Python\Test\MergeFilesOut\Export.txt'
olddata = set()
with open(FO) as master:
for row in enumerate(csv.reader(master, delimiter = '|')):
key = '|'.join(row[:3])
olddata.add(key)
data = []
for input_file in glob.glob(os.path.join(DI, '*.txt')):
with open(input_file) as finput:
for i, row in enumerate(csv.reader(finput, delimiter = '|')):
key = '|'.join(row)
if key not in olddata:
to_append = "Filename" if i==0 else input_file
data.append(row+[to_append])
olddata.add(key)
print(data)
with open(FO, 'w') as foutput:
for key in olddata:
foutput.write(key + '\n')
此代码将合并“DI”文件夹中文件的所有数据,并将文件名附加到导出文件中。目前,标题也被复制。
我需要做什么代码:从所有父文件更新主文件,其中包含看不见的记录(甚至在一列中不同),忽略标题。我认为关键在于将所有父文件加载到列表(或dict?)中,然后将其与子列表进行比较并仅采用差异 - 这是我希望开始使用olddata = []部分完成的,但我迷失了,想弄清楚如何将其与data = []进行比较。行的顺序(标题除外)无关紧要。
程序应如何运作的示例:
if export file =
header1|header2|header3|filename|timestamp
and parent files =
File1:
header1|header2|header3|
apple|banana|strawberry
File2:
header1|header2|header3|
apple|banana|blueberry
File3:
header1|header2|header3|
apple|banana|strawberry
pineapple|kiwi|blackberry
Run program, exportfile=
header1|header2|header3|filename|timestamp
apple|banana|strawberry|file1|0600
apple|banana|blueberry|file2|0600
pineapple|kiwi|blackberry|file3|0600
Add new parent file:
File4:
header1|header2|header3|
apple|banana|strawberry
pineapple|kiwi|blackberry
cats|dogs|birds
Run program, exportfile=
header1|header2|header3|filename|timestamp
apple|banana|strawberry|file1|0600
apple|banana|blueberry|file2|0600
pineapple|kiwi|blackberry|file3|0600
cats|dogs|birds|file4|0700
答案 0 :(得分:1)
好吧,我没有任何实际数据可供测试,但我认为所提议的解决方案都有一个共同的问题,它们都试图将数据保存在内存中。如果您的主文件增长得足够大,您可能会遇到问题。我会选择校验和。
import csv import time import glob import hashlib MASTER_FILE = "master.csv" # replace with a real name def checksum(msg): hasher = hashlib.md5() hasher.update(str(msg).encode()) return hasher.digest() def unique_line(line, masterdata): return checksum(line) not in masterdata def append_to_master(line, filename, master_filename): with open(master_filename, 'a') as master: csv.writer(master).writerow(line + [filename, time.ctime()]) if __name__ == "__main__": # load master file data masterdata = [] with open(MASTER_FILE, 'r') as master: reader = csv.reader(master) reader.next() # skip header masterdata = [checksum(line[:-2]) for line in reader] # skip the last two fields containing time stamp and file name # process new data files_to_process = glob.glob("*.csv") # replace with a real pattern for filename in files_to_process: with open(filename, 'r') as data: reader = csv.reader(data) reader.next() # skip header for line in reader: if unique_line(line, masterdata): append_to_master(line, filename, MASTER_FILE)
答案 1 :(得分:0)
一些让你前进的线索:
set()
跟踪您已经看到的内容set
中的项目应为字符串,不应包含filename
和timestamp
列因此,对于exmaple,加载olddata
可能如下所示:
olddata = set()
with open(FO) as master:
for row in csv.reader(master, delimiter = '|'):
key = '|'.join(row[:3]) # your example has three headers
olddata.add(key)
并且在阅读新数据时:
for i, row in enumerate(csv.reader(finput, delimiter = '|')):
key = '|'.join(row)
if key not in olddata:
to_append = "Filename" if i==0 else input_file
data.append(row+[to_append])
olddata.add(key)
答案 2 :(得分:-1)
这是我对你的问题的看法。我定义了一个函数updateMasterFile()来读取单个源文件和主文件。比较两者后,再将任何新数据写回主文件。
import csv
from datetime import datetime as dt
def updateMasterFile(sourceFile, masterFile):
#open source file and masterFile
with open(sourceFile, 'r') as pFile, \
open(masterFile, 'r') as mFile:
#read all rows into list(newdata) ignore header
newdata = [row for row in
csv.reader(pFile, delimiter="|")][1:]
#read all rows in to list(oldata) ignore header and last 2 columns
olddata = [row[:-2] for row in
csv.reader(mFile, delimiter="|")][1:]
# store any new data
extra = [row for row in newdata if row not in olddata]
#open master csv in append mode
with open(masterFile, 'a', newline="") as mFile:
#define csv writer
writer = csv.writer(mFile, delimiter ="|")
#write all 'extra' new data to master file
for row in extra:
#write back with source file name (full path) and time stamp
#you may want to trim the file name and define you own time stamp
writer.writerow(row + [sourceFile,
dt.strftime(dt.now(),"%y%m%d%H%M%S")])