我有两个csv文件,我想用pandas作为数据帧读入。我想合并它们,但显示时间不能重复。
如果ShowingDateTime是重复的,我想从第一个数据帧而不是第二个数据帧中选择行。我不确定用熊猫做这个的最佳方法。我想根据< CSV1: CSV2: 预期合并结果(df合并后写入csv):Address,City,State,ShowingDateTime
1234 Hodge Street,Brown,CA,1/4/17 12:00
9613 Llama Street,Downtown,CA,1/5/17 12:15
7836 Bob Street,Swamp,CA,1/5/17 12:30
2134 Cardinal Street,Ruler,CA,1/6/17 11:30
Address,City,State,ShowingDateTime
10234 Peek Street,Brown,CA,1/4/17 12:00
1122 Kara Street,Downtown,CA,1/5/17 12:30
1023 Solr Street,Swamp,CA,1/6/17 11:30
2234 Tempura Street,Ruler,CA,1/6/17 12:00
1234 Hodge Street,Brown,CA,1/4/17 12:00
9613 Llama Street,Downtown,CA,1/5/17 12:15
7836 Bob Street,Swamp,CA,1/5/17 12:30
2134 Cardinal Street,Ruler,CA,1/6/17 11:30
2234 Tempura Street,Ruler,CA,1/6/17 12:00
答案 0 :(得分:3)
import pandas as pd
df1 = pd.read_csv('path_of_first_csv_file')
df2 = pd.read_csv('path_of_second_csv_file')
df3 = pd.concat([df1, df2], ignore_index=True)
df3 = df3.drop_duplicates(subset='ShowingDateTime', keep="first")
print(df3)
df3.to_csv('output.csv')
答案 1 :(得分:1)
您想要concat()
而不是合并。
首先要加载每个csv。
df1 = pd.read_csv('csv1.csv')
df2 = pd.read_csv('csv2.csv')
然后连接两个dfs。
final_df = pd.concat([df1,df2],how='outer', ignore_index=True)
然后删除ShowingDateTime
的重复项,在这些情况下保留df1行
final_df.drop_duplicates(subset=['ShowingDateTime'], keep='first')
然后保存为csv
final_df.to_csv('final.csv')
答案 2 :(得分:0)
我建议稍微搜索其他问题,你会找到更有效的方法,比如here。它概述了在处理大型数据集时如何使用df.index.duplicated(keep='first')
提高性能效率,在您的情况下,可以按如下方式进行:
directory = './records/'
all_files = [f for f in os.listdir(directory)]
df = pd.concat((pd.read_csv(directory+f, index_col=3) for f in all_files)) #specify ShowingDateTime as index column
df = df[~df.index.duplicated(keep='first')] #keep only the first index