合并pandas数据帧会复制一些数据

时间:2016-03-23 01:49:50

标签: python pandas merge

感谢您花时间阅读我的帖子。

我使用Python pandas并合并来自许多CSV和TSV文件的信息。当我执行第二个合并时,数据在结果数据框中重复。我假设,我在合并电话中遗漏了一些基本的东西,但我还没有弄明白。

代码:

from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd 
import sys
import matplotlib

# Enable inline plotting
%matplotlib inline

# read data into dataframes
ticketdata = r'/pathto.csv'
ticketdata = r'/pathto.csv'
userdata = r'/pathto.csv'
shipmentdata = r'/pathto.tsv'

tickets_df = pd.read_csv((ticketdata), usecols=['Id',"Requester",'Created at',"Requester email",
                                                "Requester external id"])
users_df = pd.read_csv((userdata), usecols=['External ID','Printers',"Organization Title"])
shipment_df = pd.read_csv((shipmentdata), delimiter='\t', usecols=['Cust','Printer ID'])

# Clean up tickets_df & shipment_df

# Change "Requester external id" to "External ID" to support the merge
tickets_df.columns = ['Ticket Id',"Requester","External ID","Requester email",'Created at']
shipment_df.columns = ['VAR','Printers']
# Change column order for the sake of readability
tickets_df = tickets_df[['Ticket Id','Requester','Created at',"Requester email","External ID"]]

# Replace NaN in External ID with 0 and merge data
tickets_df.fillna(0, inplace=True)
merge1_df = pd.merge(tickets_df, users_df, on=['External ID'], how='left')
merge1_df = merge1_df[['Ticket Id','Created at',"Organization Title",'Requester',"Requester email","External ID",'Printers']]
merge2_df = pd.merge(merge1_df, shipment_df, on=['Printers'], how='left')

merge1_df按预期显示(某些值需要NaN):

    Ticket Id   Created at  Organization Title  Requester   Requester email     External ID     Printers
0   1   2014-08-21 18:19    NaN     dude    dude@dude.com   0   NaN
1   2   2014-09-09 12:04    NaN     dude1   duke1@dude.com  0   NaN
2   3   2014-09-09 12:04    NaN     dude2   duke2@dude.com  0   NaN
3   4   2014-09-09 12:04    NaN     dude3   duke3@dude.com  0   NaN

merge2_df包含数以千计的欺骗行为:

    Ticket Id   Created at  Organization Title  Requester   Requester email     External ID     Printers
0   1   2014-08-21 18:19    NaN     dude    dude@dude.com   0   NaN
1   1   2014-08-21 18:19    NaN     dude    dude@dude.com   0   NaN
2   1   2014-08-21 18:19    NaN     dude    dude@dude.com   0   NaN
3   1   2014-08-21 18:19    NaN     dude    dude@dude.com   0   NaN

我有什么想法搞乱merge2_df?

1 个答案:

答案 0 :(得分:0)

问题在于shipping_df数据框中的NaN值。我添加了以下内容以将NaN替换为0并且merge2_df中的重复条目已解决

shipment_df.fillna(0, inplace=True)