
时间:2015-03-17 17:37:07

标签: python csv


  1. 读入基本CSV文件 - 这可能因字段名称和字段数而异
  2. 读入包含一个ID的辅助CSV文件,该ID应与第一个文件中的ID列匹配,另外还有一个新ID
  3. 创建一个输出CSV文件,其中包含文件1中的列标题+文件2
  4. 中的列标题
  5. 打印输出文件中作为第一个文件的整行内容的行,以及第二个文件中匹配的ID。
  6. 我正在努力解决如何正确创建输出文件的标题行以及如何使用第二个匹配的键包含第一个文件中的整行。我能够让这个与读者一起工作,但需要切换到DictReader以避免硬编码列号,因为它们可以改变。 这是我的尝试。任何帮助非常感谢!


    示例文件1: [{'LEGACY_ID':'123','随机列':'忽略我但打印我','另一列':'忽略我'}, {'LEGACY_ID':'1234','随机列':'忽略我但打印我','另一列':'也忽略我'} ...]

    示例文件2: [{'NEW_ID':'abc','LEGACY_ID':'123'},{'NEW_ID ':'abcd','LEGACY_ID':'1234'} ...]

    示例输出: [{'LEGACY_ID':'123','随机列':'忽略我但打印我','另一列':'忽略我','NEW_ID':'abc'}, {'LEGACY_ID':'1234','随机列':'忽略我但打印我','另一列':'忽略我','NEW_ID':'abcd'} ...]

    import csv
    import string
    with open('legacyFile.csv', 'r') as in_leg, open('NewMapping.csv', 'r') as in_map, open('results.csv', 'wb') as out_res:
        c1 = csv.DictReader(in_leg, delimiter=',')
        c2 = csv.DictReader(in_map, delimiter=',') 
        print c1.fieldnames
        print c2.fieldnames
        #set headers and write header row to output file
        File1List = list(c1)
        File2List = list(c2)
        fieldnames = (str(c1.fieldnames) + str(c2.fieldnames)) 
        fieldnames = string.replace(fieldnames, '][', ', ')
        print (fieldnames)
        c3 = csv.DictWriter(out_res, fieldnames=fieldnames)
        print ' c3 ' + c3.fieldnames
        for File1Row in File1List:
            row = 1
            found = False
            print ('ID IS ' + File1Row['ID'])
            for File2Row in File2List:
                if File1Row['ID'] == File2Row['LEGACY_ID']:
                    #need to write the entire File1Row to c3, PLUS the matched ID that is found
                    #c3.writerow(File1Row + File2Row['NEW_ID'])
                    print ('Found New ID of ' +  File2Row['NEW_ID'] + ' at row ' + str(row))
                    found = True
                row += 1
            if not found:
                #need to write the entire File1Row to c3, with null value for non-matching values
                print ('not found')

3 个答案:

答案 0 :(得分:2)


import pandas as pd
df_old = pd.read_csv("legacyFile.csv")
df_new = pd.read_csv("NewMapping.csv")
df_merged = df_old.merge(df_new, left_on="ID", right_on="LEGACY_ID", how="outer")
df_merged.to_csv("combined.csv", index=False)


>>> df_old
   ID col1 col2
0   1    a    b
1   2    c    d
2   3    e    f
3   4    g    h


>>> df_new
   LEGACY_ID  NEW_ID  other_new_column
0          1     100             12.34
1          2     200             56.78
2          4     400             90.12


>>> df_merged
   ID col1 col2  LEGACY_ID  NEW_ID  other_new_column
0   1    a    b          1     100             12.34
1   2    c    d          2     200             56.78
2   3    e    f        NaN     NaN               NaN
3   4    g    h          4     400             90.12


答案 1 :(得分:0)

我建议使用csvkit' csvjoin命令吗?


csvjoin --columns LEGACY_ID file1.csv file2.csv > new.csv


答案 2 :(得分:0)



ID  col1    col2
1   a   b
2   c   d
3   e   f
4   g   h


LEGACY_ID   NEW_ID  other_new_column
1   100 12.34
2   200 56.78
4   400 90.12


import csv

with open('legacyFile.csv', 'r') as in_leg, open('NewMapping.csv', 'r') as in_map:
    the_map_reader = csv.DictReader(in_map, delimiter='\t')
    the_map = list(the_map_reader)      # read the whole map file in-memory, to execute searches

    # construct a dict, where LEGACY_ID is the key, and the value is the number of the row, in the map file
    legacy_ids = {row['LEGACY_ID']: row_number for (row_number, row) in enumerate(the_map)}

    # a simple dictionary used for output, when the map file has no such LEGACY_ID key
    missing_map_line = {key: '-' for key in the_map[0]}

    source = csv.DictReader(in_leg, delimiter='\t')

    with open('output.csv', 'wb') as out_res:
        # the output's columns are the combination of the source's fand the map's files
        writer = csv.DictWriter(out_res, delimiter='\t', fieldnames=source.fieldnames + the_map_reader.fieldnames)
        # to create the header row
        for row in source:
            # get the number of the row in the map file, where ID == LEGACY_ID
            mapped_row_number = legacy_ids.get(row['ID'], -1)
            # if that row is present - use it, if not - the dummy line created above
            # at this step, if you don't want to output lines where the map file has no entry for this ID,
            # you could just call continue
            # if mapped_row_number == -1 : continue
            mapped_row = the_map[mapped_row_number] if mapped_row_number != -1 else missing_map_line

            # generate the resulting row
            result_line = row.copy()
            # and write it in the output file


ID  col1    col2    LEGACY_ID   NEW_ID  other_new_column
1   a   b   1   100 12.34
2   c   d   2   200 56.78
3   e   f   -   -   -
4   g   h   4   400 90.12
