所以我想在两个表上包含一个包含两个表的键字段的连接,所以我可以进行并排比较。
表A有1164行,表B有74行。表A中的公共字段称为EmployeeID,表B中的“相同”字段称为UserID。
我希望输出有3种形式:
解决此问题的最佳方法是什么?
当我使用此代码时:
data_left_join = pd.merge(table_a, table_b, how='left')
我得到了48268行。
我能找到的所有文章都是SQL或R.
我设法导入表并对表进行一些修改。但我被困在这里。
提前感谢。
答案 0 :(得分:0)
您可以将merge()
与left_on
和right_on
参数一起使用来指定您的密钥字段,然后检查NaN值以查找哪些行不在表中:
import pandas as pd
# Create dataframes to test with
table_a = pd.DataFrame({
"value": [1, 2, 3, 4, 5],
"employee_id": [100, 200, 300, 400, 500]
})
print "Table A:\n", table_a
table_b = pd.DataFrame({
"value": [1, 2, 3, 4, 5],
"user_id": [100, 200, 300, 1000, 2000],
"age": [40, 50, 60, 70, 80]
})
print "\nTable B:\n", table_b
# Merge table A (left) on employee_id, and table B (right) on user_id
merged = table_a.merge(table_b, left_on="employee_id", right_on="user_id", how="outer", suffixes=("_tableA", "_tableB"))
print "\nMerged:\n", merged
# Table A-columns with NaNs are not present in table B
only_in_table_a = merged.loc[merged.value_tableB.isnull()]
print "\nOnly in table A:\n", only_in_table_a
# Table B-columns with NaNs are not present in table A
only_in_table_b = merged.loc[merged.value_tableA.isnull()]
print "\nOnly in table B:\n", only_in_table_b
# Rows with no NaNs are in both tables
in_both = merged.dropna(subset=["employee_id", "user_id"])
print "\nIn both:\n", in_both
哪个收益率:
Table A:
employee_id value
0 100 1
1 200 2
2 300 3
3 400 4
4 500 5
Table B:
age user_id value
0 40 100 1
1 50 200 2
2 60 300 3
3 70 1000 4
4 80 2000 5
Merged:
employee_id value_tableA age user_id value_tableB
0 100.0 1.0 40.0 100.0 1.0
1 200.0 2.0 50.0 200.0 2.0
2 300.0 3.0 60.0 300.0 3.0
3 400.0 4.0 NaN NaN NaN
4 500.0 5.0 NaN NaN NaN
5 NaN NaN 70.0 1000.0 4.0
6 NaN NaN 80.0 2000.0 5.0
Only in table A:
employee_id value_tableA age user_id value_tableB
3 400.0 4.0 NaN NaN NaN
4 500.0 5.0 NaN NaN NaN
Only in table B:
employee_id value_tableA age user_id value_tableB
5 NaN NaN 70.0 1000.0 4.0
6 NaN NaN 80.0 2000.0 5.0
In both:
employee_id value_tableA age user_id value_tableB
0 100.0 1.0 40.0 100.0 1.0
1 200.0 2.0 50.0 200.0 2.0
2 300.0 3.0 60.0 300.0 3.0