Question

我对这个脚本有几个不同的问题。目标是直截了当的，我已经找到了一些类似的例子，但没有任何东西可以让它工作起来。

读入基本CSV文件 - 这可能因字段名称和字段数而异
读入包含一个ID的辅助CSV文件，该ID应与第一个文件中的ID列匹配，另外还有一个新ID
创建一个输出CSV文件，其中包含文件1中的列标题+文件2
打印输出文件中作为第一个文件的整行内容的行，以及第二个文件中匹配的ID。

我正在努力解决如何正确创建输出文件的标题行以及如何使用第二个匹配的键包含第一个文件中的整行。我能够让这个与读者一起工作，但需要切换到DictReader以避免硬编码列号，因为它们可以改变。这是我的尝试。任何帮助非常感谢！

以下是一些示例文件：

示例文件1： [{'LEGACY_ID'：'123'，'随机列'：'忽略我但打印我'，'另一列'：'忽略我'}， {'LEGACY_ID'：'1234'，'随机列'：'忽略我但打印我'，'另一列'：'也忽略我'} ...]

示例文件2： [{'NEW_ID'：'abc'，'LEGACY_ID'：'123'}，{'NEW_ID '：'abcd'，'LEGACY_ID'：'1234'} ...]

示例输出： [{'LEGACY_ID'：'123'，'随机列'：'忽略我但打印我'，'另一列'：'忽略我'，'NEW_ID'：'abc'}， {'LEGACY_ID'：'1234'，'随机列'：'忽略我但打印我'，'另一列'：'忽略我'，'NEW_ID'：'abcd'} ...]

import csv
import string
with open('legacyFile.csv', 'r') as in_leg, open('NewMapping.csv', 'r') as in_map, open('results.csv', 'wb') as out_res:
    c1 = csv.DictReader(in_leg, delimiter=',')
    c2 = csv.DictReader(in_map, delimiter=',') 
    print c1.fieldnames
    print c2.fieldnames
    #set headers and write header row to output file
    File1List = list(c1)
    File2List = list(c2)

    fieldnames = (str(c1.fieldnames) + str(c2.fieldnames)) 
    fieldnames = string.replace(fieldnames, '][', ', ')
    print (fieldnames)

    c3 = csv.DictWriter(out_res, fieldnames=fieldnames)
    c3.writeheader()

    print ' c3 ' + c3.fieldnames

    for File1Row in File1List:
        row = 1
        found = False
        print ('ID IS ' + File1Row['ID'])
        for File2Row in File2List:
            if File1Row['ID'] == File2Row['LEGACY_ID']:
                #need to write the entire File1Row to c3, PLUS the matched ID that is found
                #c3.writerow(File1Row + File2Row['NEW_ID'])
                print ('Found New ID of ' +  File2Row['NEW_ID'] + ' at row ' + str(row))
                found = True
                break
            row += 1
        if not found:
            #need to write the entire File1Row to c3, with null value for non-matching values
            print ('not found')


    in_leg.close()
    in_map.close()
    out_res.close()

Answer 1

希望其他人会根据您的纯Python代码给出一个示例，但只是为了向您展示如何使用一些模拟数据在pandas中执行此操作：

import pandas as pd
df_old = pd.read_csv("legacyFile.csv")
df_new = pd.read_csv("NewMapping.csv")
df_merged = df_old.merge(df_new, left_on="ID", right_on="LEGACY_ID", how="outer")
df_merged.to_csv("combined.csv", index=False)

此代码合并一个DataFrame（类似于表格或excel表格），看起来像

>>> df_old
   ID col1 col2
0   1    a    b
1   2    c    d
2   3    e    f
3   4    g    h

和一个像

>>> df_new
   LEGACY_ID  NEW_ID  other_new_column
0          1     100             12.34
1          2     200             56.78
2          4     400             90.12

进入对象

>>> df_merged
   ID col1 col2  LEGACY_ID  NEW_ID  other_new_column
0   1    a    b          1     100             12.34
1   2    c    d          2     200             56.78
2   3    e    f        NaN     NaN               NaN
3   4    g    h          4     400             90.12

并将其写入csv文件。在这里，我保留了在NewMapping文件中没有匹配的第3行，但是我们可以很容易地保留完全匹配的那些。

Answer 2

我建议使用csvkit＆＃39; csvjoin命令吗？

这将允许你做

csvjoin --columns LEGACY_ID file1.csv file2.csv > new.csv

获取新的csv文件

Answer 3

捎带DSM的示例文件，这是一个纯Python解决方案。由于它非常冗长，实际上是用内联注释来描述的。

给定legacyFile.csv

ID  col1    col2
1   a   b
2   c   d
3   e   f
4   g   h

和NewMapping.csv

LEGACY_ID   NEW_ID  other_new_column
1   100 12.34
2   200 56.78
4   400 90.12

解决方案：

import csv

with open('legacyFile.csv', 'r') as in_leg, open('NewMapping.csv', 'r') as in_map:
    the_map_reader = csv.DictReader(in_map, delimiter='\t')
    the_map = list(the_map_reader)      # read the whole map file in-memory, to execute searches

    # construct a dict, where LEGACY_ID is the key, and the value is the number of the row, in the map file
    legacy_ids = {row['LEGACY_ID']: row_number for (row_number, row) in enumerate(the_map)}

    # a simple dictionary used for output, when the map file has no such LEGACY_ID key
    missing_map_line = {key: '-' for key in the_map[0]}

    source = csv.DictReader(in_leg, delimiter='\t')

    with open('output.csv', 'wb') as out_res:
        # the output's columns are the combination of the source's fand the map's files
        writer = csv.DictWriter(out_res, delimiter='\t', fieldnames=source.fieldnames + the_map_reader.fieldnames)
        # to create the header row
        writer.writeheader()
        for row in source:
            # get the number of the row in the map file, where ID == LEGACY_ID
            mapped_row_number = legacy_ids.get(row['ID'], -1)
            # if that row is present - use it, if not - the dummy line created above
            # at this step, if you don't want to output lines where the map file has no entry for this ID,
            # you could just call continue
            # if mapped_row_number == -1 : continue
            mapped_row = the_map[mapped_row_number] if mapped_row_number != -1 else missing_map_line

            # generate the resulting row
            result_line = row.copy()
            result_line.update(mapped_row)
            # and write it in the output file
            writer.writerow(result_line)

结果是：

ID  col1    col2    LEGACY_ID   NEW_ID  other_new_column
1   a   b   1   100 12.34
2   c   d   2   200 56.78
3   e   f   -   -   -
4   g   h   4   400 90.12

这种解决方案无法比熊猫更快，尤其是在庞大的数据集上。

Python：将两个带有标题行和匹配值的CSV文件与DictReader和DictWriter组合在一起

3 个答案: