合并多个CSV文件并按字段删除重复项

时间:2018-05-01 23:12:19

标签: python python-3.x pandas csv

我需要匹配来自多个CSV文件的数据。 例如,如果我有三个CSV文件。

输入1 csv

PANYNJ LGA WEST 1,available, LGA West GarageFlushing
PANYNJ LGA WEST 4,unavailable,LGA West Garage
iPark - Tesla,unavailable,530 E 80th St

输入2 csv

PANYNJ LGA WEST 4,unavailable,LGA West Garage
PANYNJ LGA WEST 5,available,LGA West Garage

输入3 csv

PANYNJ LGA WEST 5,available,LGA West Garage
imPark - Tesla,unavailable,611 E 83rd St

第一列是name,第二列是status,最后一列是address。我想将这三个文件合并为一个csv文件,如果它们具有相同的名称。我的欲望输出文件就像

输出csv

PANYNJ LGA WEST 1,available, LGA West GarageFlushing
PANYNJ LGA WEST 4,unavailable,LGA West Garage
iPark - Tesla,unavailable,530 E 80th St
PANYNJ LGA WEST 5,available,LGA West Garage
imPark - Tesla,unavailable,611 E 83rd St

我正在尝试使用pandasCSV解决此问题,但我不确定如何解决这个问题。

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:1)

使用pandas,您可以使用pd.concat后跟pd.drop_duplicates

import pandas as pd
from io import StringIO

str1 = StringIO("""PANYNJ LGA WEST 1,available, LGA West GarageFlushing
PANYNJ LGA WEST 4,unavailable,LGA West Garage
iPark - Tesla,unavailable,530 E 80th St""")

str2 = StringIO("""PANYNJ LGA WEST 4,unavailable,LGA West Garage
PANYNJ LGA WEST 5,available,LGA West Garage""")

str3 = StringIO("""PANYNJ LGA WEST 5,available,LGA West Garage
imPark - Tesla,unavailable,611 E 83rd St""")

# replace str1, str2, str3 with 'file1.csv', 'file2.csv', 'file3.csv'
df1 = pd.read_csv(str1, header=None)
df2 = pd.read_csv(str2, header=None)
df3 = pd.read_csv(str3, header=None)

res = pd.concat([df1, df2, df3], ignore_index=True)\
        .drop_duplicates(0)

print(res)

                   0            1                         2
0  PANYNJ LGA WEST 1    available   LGA West GarageFlushing
1  PANYNJ LGA WEST 4  unavailable           LGA West Garage
2      iPark - Tesla  unavailable             530 E 80th St
4  PANYNJ LGA WEST 5    available           LGA West Garage
6     imPark - Tesla  unavailable             611 E 83rd St

答案 1 :(得分:0)

def combine_and_dedupe(files_to_combine, output_file, filter_column, fieldnames):
    '''
    Combine multiple CSV files into one final CSV file, removing duplicates
    based on one column that uniquely identifies the entry (ex: name, ID, email, etc.)
    '''
    added = []
    with open(output_file, 'w', encoding='utf-8-sig') as csvfile:
        fieldnames = fieldnames
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator = '\n')
        writer.writeheader()
        for file in files_to_combine:
            with open(file, newline='', encoding='utf-8-sig') as csvfile:
                reader = csv.DictReader(csvfile)
                for row in reader:
                    if row[filter_column] not in added:
                        added.append(row[filter_column])
                        writer.writerow(row)
                    else:
                        print('Duplicate')
                        continue
Here is a function I created to do exactly what you want.

files_to_combine is a list of the csv files Ex: ['miami_clients.csv', 'los_angeles_clients.csv']

output_file is the name of the output file

filter_column is the column to uniquely identify entries to check for duplicates

fieldnames is the list of field names for the CSV files