,并先感谢您提供任何建议。这里是首次发布者,所以我会尽力输入所有必填信息。我也是Python的初学者,一直在做一些在线教程,并从StackOverflow进行一些复制/粘贴编码,这是FrankenCoding ...所以我可能正在解决这个错误...
我需要比较两个CSV文件,它们的列数会不断变化,只有两列匹配(例如,一个文件中的email_address,另一个文件中的EMAIL)。这两个文件都有标题,但是这些标题的名称可能会更改。文件大小可能在几千行到+2,000,000之间,可能有100列以上(但更可能只有少数几列)。
输出到第三个“ results.csv”文件,其中包含所有信息。它可以是合并(所有唯一项),减(删除一个或另一个中存在的项)或相交(两个中都存在的项)。
我在这里搜索,发现了很多有用的信息,但是我看到的所有文件中都有固定数量的列。我已经尝试过dict和dictreader,我知道答案就在那儿,但是现在,我有点困惑。但是由于我已经好几天没有取得任何进展,并且我只能花很多时间在这上面,所以我希望我能朝着正确的方向前进。
理想情况下,我想自己学习如何做,这意味着了解数据是如何“移动”的。
在下面提取CSV文件,我没有添加那么多的列(我认为)是必要的,我现在拥有的数据集将在Originalid / UID或emailaddress / email上进行匹配,但这并非总是如此。>
Original.csv
"originalid","emailaddress",""
"12345678","Bob@mail.com",""
"23456789","NORMA@EMAIL.COM",""
"34567890","HENRY@some-mail.com",""
"45678901","Analisa@sports.com",""
"56789012","greta@mail.org",""
"67890123","STEVEN@EMAIL.ORG",""
比较。CSV
"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob@mail.com",,,"true"
"NORMA@EMAIL.COM",,,"true"
"HENRY@some-mail.com",,,"true"
"Henrietta@AWESOME.CA",,,"true"
"NORMAN@sports.CA",,,"true"
"albertina@justemail.CA",,,"true"
results.csv中的数据应该是Original.CSV中的所有列,再加上Compare.csv中的所有列,而不是匹配的(电子邮件):
"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob@mail.com","",,,"true"
"23456789","NORMA@EMAIL.COM","",,,"true"
"34567890","HENRY@some-mail.com","",,,"true"
这是我现在的结果:
email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,,,true,"['12345678', 'Bob@mail.com', '']"
NORMA@EMAIL.COM,,,true,"['23456789', 'NORMA@EMAIL.COM', '']"
HENRY@some-mail.com,,,true,"['34567890', 'HENRY@some-mail.com', '']"
这就是我在这里使用代码的地方,print语句将匹配的数据从文件返回到屏幕,而不是返回到文件,因此我在其中丢失了一些内容。
*****而且我没有从original.csv文件中获取标题,数据正在传入。
import csv
def get_column_from_file(filename, column_name):
f = open(filename, 'r')
reader = csv.reader(f)
headers = next(reader, None)
i = 0
max = (len(headers))
while i < max:
if headers[i] == column_name:
column_header = i
# print(headers[i])
i = i + 1
return(column_header)
file_to_check = "Original.csv"
file_console = "Compare.csv"
column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')
with open(file_console, 'r') as master:
master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))
with open('Compare.csv', 'r') as hosts:
with open('results.csv', 'w', newline='') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []))
for row in reader:
index = master_indices.get(row[0])
if index is not None:
print (row +[master_indices.get(row[0])])
writer.writerow(row +[master_indices.get(row[0])])
感谢您的时间!
拍子
答案 0 :(得分:0)
现在看来,您仅对标题使用一次writerow:
writer.writerow(next(reader, []))
正如弗朗西斯科(Francoisco)指出的那样,不加评论最后一行可能会解决您的问题。您可以通过删除行首处的“#”来做到这一点。
答案 1 :(得分:0)
我喜欢您想自己执行此操作,并意识到需要“了解数据如何移动”。这正是您应该如何思考问题的方式:关注数据的移动而不是结果。有些人可能不同意我的观点,但是我认为这是一个很好的哲学,因为它将使将来的重用变得更加容易。
您并不是要构建一个将两个CSV结合在一起的工具,而是要根据通用参考(电子邮件地址)来组织数据(恰好来自CSV)并将结果输出为CSV。因为您正在谈论潜在的大数据集(可能包含+100万列的+2,000,000 [行]),所以认识到渐近运行时非常重要。如果您不知道这是什么,建议您阅读Big-O表示法和渐近算法分析。没有这个,您可能会没事的。
首先,您要确定每个CSV中的密钥。您已经完成了此操作,其中“ Compare.csv”的“电子邮件”和“ Original.csv”的“电子邮件地址”。 现在,构建您自己的函数,以根据键从CSV生成字典。
def get_dict_from_csv(path_to_csv, key):
with open(path_to_csv, 'r') as f:
reader = csv.reader(f)
headers, *rest = reader # requires python3
key_index = headers.index(key) # find index of key
# dictionary comprehensions are your friend, just think about what you want the dict to look like
d = {row[key_index]: row[:key_index] + row[key_index+1:] # +1 to skip the email entry
for row in rest}
headers.remove(key)
d['HEADERS'] = headers # add headers so you know what the information in the dict is
return d
现在您可以在两个CSV上调用此功能。
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
现在您有两个根据相同信息锁定的字典。现在我们需要一个函数来将它们组合成一个字典。
def combine_dicts(*dicts):
d, *rest = dicts # requires python3
# iteratively pull other dicts into the first one, d
for r in rest:
original_headers = d['HEADERS'][:]
new_headers = r['HEADERS'][:]
# copy headers
d['HEADERS'].extend(new_headers)
# find missing keys
s = set(d.keys()) - set(r.keys()) # keys present in d but not in r
for k in s:
d[k].extend(['', ] * len(new_headers))
del r['HEADERS'] # we don't want to copy this a second time in the loop below
for k, v in r.items():
# use setdefault in case the key didn't exist in the first dict
d.setdefault(k, ['', ] * len(original_headers)).extend(v)
return d
现在您有了一个字典,其中包含您想要的所有信息,您所需要做的就是将其以CSV格式写回。
def write_dict_to_csv(output_file, d, include_key=False):
with open(output_file, 'w', newline='') as results:
writer = csv.writer(results)
# email isn't in your HEADERS, so you'll need to add it
if include_key:
headers = ['email',] + d['HEADERS']
else:
headers = d['HEADERS']
writer.writerow(headers)
# now remove it from the dict so we can iterate over it without including it twice
del d['HEADERS']
for k, v in d.items():
if include_key:
row = [k,] + v
else:
row = v
writer.writerow(row)
应该就是这样。调用所有这些只是
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)
您可以轻松地看到如何将其扩展到任意多个字典。
您说过,您不希望电子邮件包含在最终CSV中。这对我来说是违反直觉的,因此,如果您改变主意,则在write_dict_to_csv()中将其作为选项。
当我执行以上所有操作时,我会得到
email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,12345678,,,,true
NORMA@EMAIL.COM,23456789,,,,true
HENRY@some-mail.com,34567890,,,,true
Analisa@sports.com,45678901,,,,,
greta@mail.org,56789012,,,,,
STEVEN@EMAIL.ORG,67890123,,,,,
Henrietta@AWESOME.CA,,,,,true
NORMAN@sports.CA,,,,,true
albertina@justemail.CA,,,,,true