熊猫如何合并两个数据帧(看起来很简单,好吧;)

时间:2015-05-10 16:21:31

标签: python pandas dataframe

我正在尝试在python / pandas中的两个数据帧上进行左连接。我做不到:-( 以下是我为实现目标而编码的测试:

print "nb of common indexes=%s"%(len(set(df1.index).union(set(df2.index))))
print "nb of distinct value on specific col to merge : df1 = ", df1[col_df1].value_counts().size
print "nb of distinct value on specific col to merge : df2 = ", df2[col_df2].value_counts().size
print "Expected size = df1 = ", df1[col_df1].value_counts().size
print "df1= ", df1.shape
print "df2= ", df2.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left')
print "new_df / left = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right')
print "new_df / right = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True) 
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape

结果=

nb of common indexes=1147
nb of distinct value on specific col to merge : df1 =  848
nb of distinct value on specific col to merge : df2 =  1147
Expected size = df1 =  848
df1=  (9999, 53)
df2=  (1867, 19)
new_df / left =  (18582, 72)
new_df / right =  (18913, 72)
new_df / right index =  (18913, 72)
new_df / left index =  (18913, 72)

我找不到正确的组合只能回到我左边的848行df ...有没有人看到错误?

编辑:

new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True)
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape

得到:

new_df / right index =  (18582, 72)
new_df / left index =  (18582, 72)
new_df / right @ left index =  (18582, 72)

仍然不是想要的值: - (

1 个答案:

答案 0 :(得分:0)

此代码没有错误:根据文档(实际上是O'reilly'python for data analysis',p179 / 180),多对多左连接会产生交叉产品。

(我最后在猪身上编程很多,行为不一样:只保留左侧的线条。或者我可能会混淆:-()