我正在尝试在python / pandas中的两个数据帧上进行左连接。我做不到:-( 以下是我为实现目标而编码的测试:
print "nb of common indexes=%s"%(len(set(df1.index).union(set(df2.index))))
print "nb of distinct value on specific col to merge : df1 = ", df1[col_df1].value_counts().size
print "nb of distinct value on specific col to merge : df2 = ", df2[col_df2].value_counts().size
print "Expected size = df1 = ", df1[col_df1].value_counts().size
print "df1= ", df1.shape
print "df2= ", df2.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left')
print "new_df / left = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right')
print "new_df / right = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True)
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape
结果=
nb of common indexes=1147
nb of distinct value on specific col to merge : df1 = 848
nb of distinct value on specific col to merge : df2 = 1147
Expected size = df1 = 848
df1= (9999, 53)
df2= (1867, 19)
new_df / left = (18582, 72)
new_df / right = (18913, 72)
new_df / right index = (18913, 72)
new_df / left index = (18913, 72)
我找不到正确的组合只能回到我左边的848行df ...有没有人看到错误?
编辑:
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True)
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape
得到:
new_df / right index = (18582, 72)
new_df / left index = (18582, 72)
new_df / right @ left index = (18582, 72)
仍然不是想要的值: - (
答案 0 :(得分:0)
此代码没有错误:根据文档(实际上是O'reilly'python for data analysis',p179 / 180),多对多左连接会产生交叉产品。
(我最后在猪身上编程很多,行为不一样:只保留左侧的线条。或者我可能会混淆:-()