感谢您花时间阅读我的帖子。
我使用Python pandas并合并来自许多CSV和TSV文件的信息。当我执行第二个合并时,数据在结果数据框中重复。我假设,我在合并电话中遗漏了一些基本的东西,但我还没有弄明白。
代码:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import matplotlib
# Enable inline plotting
%matplotlib inline
# read data into dataframes
ticketdata = r'/pathto.csv'
ticketdata = r'/pathto.csv'
userdata = r'/pathto.csv'
shipmentdata = r'/pathto.tsv'
tickets_df = pd.read_csv((ticketdata), usecols=['Id',"Requester",'Created at',"Requester email",
"Requester external id"])
users_df = pd.read_csv((userdata), usecols=['External ID','Printers',"Organization Title"])
shipment_df = pd.read_csv((shipmentdata), delimiter='\t', usecols=['Cust','Printer ID'])
# Clean up tickets_df & shipment_df
# Change "Requester external id" to "External ID" to support the merge
tickets_df.columns = ['Ticket Id',"Requester","External ID","Requester email",'Created at']
shipment_df.columns = ['VAR','Printers']
# Change column order for the sake of readability
tickets_df = tickets_df[['Ticket Id','Requester','Created at',"Requester email","External ID"]]
# Replace NaN in External ID with 0 and merge data
tickets_df.fillna(0, inplace=True)
merge1_df = pd.merge(tickets_df, users_df, on=['External ID'], how='left')
merge1_df = merge1_df[['Ticket Id','Created at',"Organization Title",'Requester',"Requester email","External ID",'Printers']]
merge2_df = pd.merge(merge1_df, shipment_df, on=['Printers'], how='left')
merge1_df按预期显示(某些值需要NaN):
Ticket Id Created at Organization Title Requester Requester email External ID Printers
0 1 2014-08-21 18:19 NaN dude dude@dude.com 0 NaN
1 2 2014-09-09 12:04 NaN dude1 duke1@dude.com 0 NaN
2 3 2014-09-09 12:04 NaN dude2 duke2@dude.com 0 NaN
3 4 2014-09-09 12:04 NaN dude3 duke3@dude.com 0 NaN
merge2_df包含数以千计的欺骗行为:
Ticket Id Created at Organization Title Requester Requester email External ID Printers
0 1 2014-08-21 18:19 NaN dude dude@dude.com 0 NaN
1 1 2014-08-21 18:19 NaN dude dude@dude.com 0 NaN
2 1 2014-08-21 18:19 NaN dude dude@dude.com 0 NaN
3 1 2014-08-21 18:19 NaN dude dude@dude.com 0 NaN
我有什么想法搞乱merge2_df?
答案 0 :(得分:0)
问题在于shipping_df数据框中的NaN值。我添加了以下内容以将NaN替换为0并且merge2_df中的重复条目已解决
shipment_df.fillna(0, inplace=True)