基于关键字段在python中连接具有不同长度的表

时间:2018-02-19 13:04:59

标签: python pandas join inner-join

所以我想在两个表上包含一个包含两个表的键字段的连接,所以我可以进行并排比较。

表A有1164行,表B有74行。表A中的公共字段称为EmployeeID,表B中的“相同”字段称为UserID。

Table A

Table B

我希望输出有3种形式:

  1. 表1,其中的关键字段值仅在TableA中找到。 (UNMATCHED LEFT)
  2. 表2列出了匹配的记录(因此在表A和B中找到了关键字段值。(INNER JOIN)
  3. 表3仅包含表B中的记录。(UNMATCHED RIGHT)
  4. 解决此问题的最佳方法是什么?

    当我使用此代码时:

    data_left_join = pd.merge(table_a, table_b, how='left')
    

    我得到了48268行。

    我能找到的所有文章都是SQL或R.

    我设法导入表并对表进行一些修改。但我被困在这里。

    提前感谢。

1 个答案:

答案 0 :(得分:0)

您可以将merge()left_onright_on参数一起使用来指定您的密钥字段,然后检查NaN值以查找哪些行不在表中:

import pandas as pd

# Create dataframes to test with
table_a = pd.DataFrame({
    "value": [1, 2, 3, 4, 5],
    "employee_id": [100, 200, 300, 400, 500]
})
print "Table A:\n", table_a

table_b = pd.DataFrame({
    "value": [1, 2, 3, 4, 5],
    "user_id": [100, 200, 300, 1000, 2000],
    "age": [40, 50, 60, 70, 80]
})
print "\nTable B:\n", table_b

# Merge table A (left) on employee_id, and table B (right) on user_id
merged = table_a.merge(table_b, left_on="employee_id", right_on="user_id", how="outer", suffixes=("_tableA", "_tableB"))
print "\nMerged:\n", merged

# Table A-columns with NaNs are not present in table B
only_in_table_a = merged.loc[merged.value_tableB.isnull()]
print "\nOnly in table A:\n", only_in_table_a

# Table B-columns with NaNs are not present in table A
only_in_table_b = merged.loc[merged.value_tableA.isnull()]
print "\nOnly in table B:\n", only_in_table_b

# Rows with no NaNs are in both tables
in_both = merged.dropna(subset=["employee_id", "user_id"])
print "\nIn both:\n", in_both

哪个收益率:

Table A:
   employee_id  value
0          100      1
1          200      2
2          300      3
3          400      4
4          500      5

Table B:
   age  user_id  value
0   40      100      1
1   50      200      2
2   60      300      3
3   70     1000      4
4   80     2000      5

Merged:
   employee_id  value_tableA   age  user_id  value_tableB
0        100.0           1.0  40.0    100.0           1.0
1        200.0           2.0  50.0    200.0           2.0
2        300.0           3.0  60.0    300.0           3.0
3        400.0           4.0   NaN      NaN           NaN
4        500.0           5.0   NaN      NaN           NaN
5          NaN           NaN  70.0   1000.0           4.0
6          NaN           NaN  80.0   2000.0           5.0

Only in table A:
   employee_id  value_tableA  age  user_id  value_tableB
3        400.0           4.0  NaN      NaN           NaN
4        500.0           5.0  NaN      NaN           NaN

Only in table B:
   employee_id  value_tableA   age  user_id  value_tableB
5          NaN           NaN  70.0   1000.0           4.0
6          NaN           NaN  80.0   2000.0           5.0

In both:
   employee_id  value_tableA   age  user_id  value_tableB
0        100.0           1.0  40.0    100.0           1.0
1        200.0           2.0  50.0    200.0           2.0
2        300.0           3.0  60.0    300.0           3.0