Pandas比较两个数据帧并确定匹配的值

时间:2017-01-26 22:22:51

标签: python pandas dataframe string-matching

我有以下数据框:

print(dfa)

ID                             Value
AA12 101 BB101 CC01 DE06       1
AA11 102 BB101 CC01 234 EE07   2
AA10 202 BB101 CC01 345 EE09   3
AA13 103 BB101 CC02 123        4
AA14 203 BB101 CC02 456        5
AA15 204 BB102 CC03 567        6


print(dfb)

ID                             Value
AA10 202 BB101 CC01 EE09 345   3
AA11 102 BB101 CC01 EE07 234   2
AA12 101 BB101 CC01 DE06       1
AA13 103 BB101 CC02 123        4
AA18 203 BB103 CC01 456        5
AA15 204 BB201 CC11 678        7

我想将(dfa.ID,dfa.Value)中的字符串与(dfb.ID,dfb.Value)中的字符串进行比较。如果它们完全匹配(即使子字符串的顺序不相同),我想在新的“ID匹配”上打印“是”?和'价值匹配?'数据框'dfa'中的列。

所需的输出将是:

ID                             Value   ID Matched?   Value Matched?
AA12 101 BB101 CC01 DE06       1       Yes           Yes 
AA11 102 BB101 CC01 234 EE07   2       Yes           Yes
AA10 202 BB101 CC01 345 EE09   3       Yes           Yes
AA13 103 BB101 CC02 123        4       Yes           Yes
AA14 203 BB101 CC02 456        5       No            Yes
AA15 204 BB102 CC03 567        6       No            No

1 个答案:

答案 0 :(得分:1)

你可以做类似的事情:

In [40]: pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
    ...:          b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
    ...:          on=['x','Value'],
    ...:          how='outer',
    ...:          indicator=True)
    ...:
Out[40]:
                           ID_x  Value                             x  \
0      AA12 101 BB101 CC01 DE06      1      101 AA12 BB101 CC01 DE06
1  AA11 102 BB101 CC01 234 EE07      2  102 234 AA11 BB101 CC01 EE07
2  AA10 202 BB101 CC01 345 EE09      3  202 345 AA10 BB101 CC01 EE09
3       AA13 103 BB101 CC02 123      4       103 123 AA13 BB101 CC02
4       AA14 203 BB101 CC02 456      5       203 456 AA14 BB101 CC02
5       AA15 204 BB102 CC03 567      6       204 567 AA15 BB102 CC03
6                           NaN      5       203 456 AA18 BB103 CC01
7                           NaN      7       204 678 AA15 BB201 CC11

                           ID_y      _merge
0      AA12 101 BB101 CC01 DE06        both
1  AA11 102 BB101 CC01 EE07 234        both
2  AA10 202 BB101 CC01 EE09 345        both
3       AA13 103 BB101 CC02 123        both
4                           NaN   left_only
5                           NaN   left_only
6       AA18 203 BB103 CC01 456  right_only
7       AA15 204 BB201 CC11 678  right_only

<强>解释

In [43]: a.ID.str.split()
Out[43]:
0         [AA12, 101, BB101, CC01, DE06]
1    [AA11, 102, BB101, CC01, 234, EE07]
2    [AA10, 202, BB101, CC01, 345, EE09]
3          [AA13, 103, BB101, CC02, 123]
4          [AA14, 203, BB101, CC02, 456]
5          [AA15, 204, BB102, CC03, 567]
Name: ID, dtype: object

In [44]: a.ID.str.split().apply(sorted)
Out[44]:
0         [101, AA12, BB101, CC01, DE06]
1    [102, 234, AA11, BB101, CC01, EE07]
2    [202, 345, AA10, BB101, CC01, EE09]
3          [103, 123, AA13, BB101, CC02]
4          [203, 456, AA14, BB101, CC02]
5          [204, 567, AA15, BB102, CC03]
Name: ID, dtype: object

In [45]: a.assign(x=a.ID.str.split().apply(sorted).str.join(' '))
Out[45]:
                             ID  Value                             x
0      AA12 101 BB101 CC01 DE06      1      101 AA12 BB101 CC01 DE06
1  AA11 102 BB101 CC01 234 EE07      2  102 234 AA11 BB101 CC01 EE07
2  AA10 202 BB101 CC01 345 EE09      3  202 345 AA10 BB101 CC01 EE09
3       AA13 103 BB101 CC02 123      4       103 123 AA13 BB101 CC02
4       AA14 203 BB101 CC02 456      5       203 456 AA14 BB101 CC02
5       AA15 204 BB102 CC03 567      6       204 567 AA15 BB102 CC03