我需要匹配来自多个CSV文件的数据。 例如,如果我有三个CSV文件。
输入1 csv
PANYNJ LGA WEST 1,available, LGA West GarageFlushing
PANYNJ LGA WEST 4,unavailable,LGA West Garage
iPark - Tesla,unavailable,530 E 80th St
输入2 csv
PANYNJ LGA WEST 4,unavailable,LGA West Garage
PANYNJ LGA WEST 5,available,LGA West Garage
输入3 csv
PANYNJ LGA WEST 5,available,LGA West Garage
imPark - Tesla,unavailable,611 E 83rd St
第一列是name
,第二列是status
,最后一列是address
。我想将这三个文件合并为一个csv文件,如果它们具有相同的名称。我的欲望输出文件就像
输出csv
PANYNJ LGA WEST 1,available, LGA West GarageFlushing
PANYNJ LGA WEST 4,unavailable,LGA West Garage
iPark - Tesla,unavailable,530 E 80th St
PANYNJ LGA WEST 5,available,LGA West Garage
imPark - Tesla,unavailable,611 E 83rd St
我正在尝试使用pandas
或CSV
解决此问题,但我不确定如何解决这个问题。
非常感谢任何帮助!
答案 0 :(得分:1)
使用pandas
,您可以使用pd.concat
后跟pd.drop_duplicates
:
import pandas as pd
from io import StringIO
str1 = StringIO("""PANYNJ LGA WEST 1,available, LGA West GarageFlushing
PANYNJ LGA WEST 4,unavailable,LGA West Garage
iPark - Tesla,unavailable,530 E 80th St""")
str2 = StringIO("""PANYNJ LGA WEST 4,unavailable,LGA West Garage
PANYNJ LGA WEST 5,available,LGA West Garage""")
str3 = StringIO("""PANYNJ LGA WEST 5,available,LGA West Garage
imPark - Tesla,unavailable,611 E 83rd St""")
# replace str1, str2, str3 with 'file1.csv', 'file2.csv', 'file3.csv'
df1 = pd.read_csv(str1, header=None)
df2 = pd.read_csv(str2, header=None)
df3 = pd.read_csv(str3, header=None)
res = pd.concat([df1, df2, df3], ignore_index=True)\
.drop_duplicates(0)
print(res)
0 1 2
0 PANYNJ LGA WEST 1 available LGA West GarageFlushing
1 PANYNJ LGA WEST 4 unavailable LGA West Garage
2 iPark - Tesla unavailable 530 E 80th St
4 PANYNJ LGA WEST 5 available LGA West Garage
6 imPark - Tesla unavailable 611 E 83rd St
答案 1 :(得分:0)
def combine_and_dedupe(files_to_combine, output_file, filter_column, fieldnames):
'''
Combine multiple CSV files into one final CSV file, removing duplicates
based on one column that uniquely identifies the entry (ex: name, ID, email, etc.)
'''
added = []
with open(output_file, 'w', encoding='utf-8-sig') as csvfile:
fieldnames = fieldnames
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator = '\n')
writer.writeheader()
for file in files_to_combine:
with open(file, newline='', encoding='utf-8-sig') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row[filter_column] not in added:
added.append(row[filter_column])
writer.writerow(row)
else:
print('Duplicate')
continue
Here is a function I created to do exactly what you want.
files_to_combine is a list of the csv files Ex: ['miami_clients.csv', 'los_angeles_clients.csv']
output_file is the name of the output file
filter_column is the column to uniquely identify entries to check for duplicates
fieldnames is the list of field names for the CSV files