我对这个脚本有几个不同的问题。目标是直截了当的,我已经找到了一些类似的例子,但没有任何东西可以让它工作起来。
我正在努力解决如何正确创建输出文件的标题行以及如何使用第二个匹配的键包含第一个文件中的整行。我能够让这个与读者一起工作,但需要切换到DictReader以避免硬编码列号,因为它们可以改变。 这是我的尝试。任何帮助非常感谢!
以下是一些示例文件:
示例文件1: [{'LEGACY_ID':'123','随机列':'忽略我但打印我','另一列':'忽略我'}, {'LEGACY_ID':'1234','随机列':'忽略我但打印我','另一列':'也忽略我'} ...]
示例文件2: [{'NEW_ID':'abc','LEGACY_ID':'123'},{'NEW_ID ':'abcd','LEGACY_ID':'1234'} ...]
示例输出: [{'LEGACY_ID':'123','随机列':'忽略我但打印我','另一列':'忽略我','NEW_ID':'abc'}, {'LEGACY_ID':'1234','随机列':'忽略我但打印我','另一列':'忽略我','NEW_ID':'abcd'} ...]
import csv
import string
with open('legacyFile.csv', 'r') as in_leg, open('NewMapping.csv', 'r') as in_map, open('results.csv', 'wb') as out_res:
c1 = csv.DictReader(in_leg, delimiter=',')
c2 = csv.DictReader(in_map, delimiter=',')
print c1.fieldnames
print c2.fieldnames
#set headers and write header row to output file
File1List = list(c1)
File2List = list(c2)
fieldnames = (str(c1.fieldnames) + str(c2.fieldnames))
fieldnames = string.replace(fieldnames, '][', ', ')
print (fieldnames)
c3 = csv.DictWriter(out_res, fieldnames=fieldnames)
c3.writeheader()
print ' c3 ' + c3.fieldnames
for File1Row in File1List:
row = 1
found = False
print ('ID IS ' + File1Row['ID'])
for File2Row in File2List:
if File1Row['ID'] == File2Row['LEGACY_ID']:
#need to write the entire File1Row to c3, PLUS the matched ID that is found
#c3.writerow(File1Row + File2Row['NEW_ID'])
print ('Found New ID of ' + File2Row['NEW_ID'] + ' at row ' + str(row))
found = True
break
row += 1
if not found:
#need to write the entire File1Row to c3, with null value for non-matching values
print ('not found')
in_leg.close()
in_map.close()
out_res.close()
答案 0 :(得分:2)
希望其他人会根据您的纯Python代码给出一个示例,但只是为了向您展示如何使用一些模拟数据在pandas
中执行此操作:
import pandas as pd
df_old = pd.read_csv("legacyFile.csv")
df_new = pd.read_csv("NewMapping.csv")
df_merged = df_old.merge(df_new, left_on="ID", right_on="LEGACY_ID", how="outer")
df_merged.to_csv("combined.csv", index=False)
此代码合并一个DataFrame(类似于表格或excel表格),看起来像
>>> df_old
ID col1 col2
0 1 a b
1 2 c d
2 3 e f
3 4 g h
和一个像
>>> df_new
LEGACY_ID NEW_ID other_new_column
0 1 100 12.34
1 2 200 56.78
2 4 400 90.12
进入对象
>>> df_merged
ID col1 col2 LEGACY_ID NEW_ID other_new_column
0 1 a b 1 100 12.34
1 2 c d 2 200 56.78
2 3 e f NaN NaN NaN
3 4 g h 4 400 90.12
并将其写入csv文件。在这里,我保留了在NewMapping文件中没有匹配的第3行,但是我们可以很容易地保留完全匹配的那些。
答案 1 :(得分:0)
这将允许你做
csvjoin --columns LEGACY_ID file1.csv file2.csv > new.csv
获取新的csv文件
答案 2 :(得分:0)
捎带DSM的示例文件,这是一个纯Python解决方案。由于它非常冗长,实际上是用内联注释来描述的。
给定legacyFile.csv
ID col1 col2
1 a b
2 c d
3 e f
4 g h
和NewMapping.csv
LEGACY_ID NEW_ID other_new_column
1 100 12.34
2 200 56.78
4 400 90.12
解决方案:
import csv
with open('legacyFile.csv', 'r') as in_leg, open('NewMapping.csv', 'r') as in_map:
the_map_reader = csv.DictReader(in_map, delimiter='\t')
the_map = list(the_map_reader) # read the whole map file in-memory, to execute searches
# construct a dict, where LEGACY_ID is the key, and the value is the number of the row, in the map file
legacy_ids = {row['LEGACY_ID']: row_number for (row_number, row) in enumerate(the_map)}
# a simple dictionary used for output, when the map file has no such LEGACY_ID key
missing_map_line = {key: '-' for key in the_map[0]}
source = csv.DictReader(in_leg, delimiter='\t')
with open('output.csv', 'wb') as out_res:
# the output's columns are the combination of the source's fand the map's files
writer = csv.DictWriter(out_res, delimiter='\t', fieldnames=source.fieldnames + the_map_reader.fieldnames)
# to create the header row
writer.writeheader()
for row in source:
# get the number of the row in the map file, where ID == LEGACY_ID
mapped_row_number = legacy_ids.get(row['ID'], -1)
# if that row is present - use it, if not - the dummy line created above
# at this step, if you don't want to output lines where the map file has no entry for this ID,
# you could just call continue
# if mapped_row_number == -1 : continue
mapped_row = the_map[mapped_row_number] if mapped_row_number != -1 else missing_map_line
# generate the resulting row
result_line = row.copy()
result_line.update(mapped_row)
# and write it in the output file
writer.writerow(result_line)
结果是:
ID col1 col2 LEGACY_ID NEW_ID other_new_column
1 a b 1 100 12.34
2 c d 2 200 56.78
3 e f - - -
4 g h 4 400 90.12
这种解决方案无法比熊猫更快,尤其是在庞大的数据集上。