我有以下数据框:
print(dfa)
ID Value
AA12 101 BB101 CC01 DE06 1
AA11 102 BB101 CC01 234 EE07 2
AA10 202 BB101 CC01 345 EE09 3
AA13 103 BB101 CC02 123 4
AA14 203 BB101 CC02 456 5
AA15 204 BB102 CC03 567 6
print(dfb)
ID Value
AA10 202 BB101 CC01 EE09 345 3
AA11 102 BB101 CC01 EE07 234 2
AA12 101 BB101 CC01 DE06 1
AA13 103 BB101 CC02 123 4
AA18 203 BB103 CC01 456 5
AA15 204 BB201 CC11 678 7
我想将(dfa.ID,dfa.Value)中的字符串与(dfb.ID,dfb.Value)中的字符串进行比较。如果它们完全匹配(即使子字符串的顺序不相同),我想在新的“ID匹配”上打印“是”?和'价值匹配?'数据框'dfa'中的列。
所需的输出将是:
ID Value ID Matched? Value Matched?
AA12 101 BB101 CC01 DE06 1 Yes Yes
AA11 102 BB101 CC01 234 EE07 2 Yes Yes
AA10 202 BB101 CC01 345 EE09 3 Yes Yes
AA13 103 BB101 CC02 123 4 Yes Yes
AA14 203 BB101 CC02 456 5 No Yes
AA15 204 BB102 CC03 567 6 No No
答案 0 :(得分:1)
你可以做类似的事情:
In [40]: pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
...: b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
...: on=['x','Value'],
...: how='outer',
...: indicator=True)
...:
Out[40]:
ID_x Value x \
0 AA12 101 BB101 CC01 DE06 1 101 AA12 BB101 CC01 DE06
1 AA11 102 BB101 CC01 234 EE07 2 102 234 AA11 BB101 CC01 EE07
2 AA10 202 BB101 CC01 345 EE09 3 202 345 AA10 BB101 CC01 EE09
3 AA13 103 BB101 CC02 123 4 103 123 AA13 BB101 CC02
4 AA14 203 BB101 CC02 456 5 203 456 AA14 BB101 CC02
5 AA15 204 BB102 CC03 567 6 204 567 AA15 BB102 CC03
6 NaN 5 203 456 AA18 BB103 CC01
7 NaN 7 204 678 AA15 BB201 CC11
ID_y _merge
0 AA12 101 BB101 CC01 DE06 both
1 AA11 102 BB101 CC01 EE07 234 both
2 AA10 202 BB101 CC01 EE09 345 both
3 AA13 103 BB101 CC02 123 both
4 NaN left_only
5 NaN left_only
6 AA18 203 BB103 CC01 456 right_only
7 AA15 204 BB201 CC11 678 right_only
<强>解释强>
In [43]: a.ID.str.split()
Out[43]:
0 [AA12, 101, BB101, CC01, DE06]
1 [AA11, 102, BB101, CC01, 234, EE07]
2 [AA10, 202, BB101, CC01, 345, EE09]
3 [AA13, 103, BB101, CC02, 123]
4 [AA14, 203, BB101, CC02, 456]
5 [AA15, 204, BB102, CC03, 567]
Name: ID, dtype: object
In [44]: a.ID.str.split().apply(sorted)
Out[44]:
0 [101, AA12, BB101, CC01, DE06]
1 [102, 234, AA11, BB101, CC01, EE07]
2 [202, 345, AA10, BB101, CC01, EE09]
3 [103, 123, AA13, BB101, CC02]
4 [203, 456, AA14, BB101, CC02]
5 [204, 567, AA15, BB102, CC03]
Name: ID, dtype: object
In [45]: a.assign(x=a.ID.str.split().apply(sorted).str.join(' '))
Out[45]:
ID Value x
0 AA12 101 BB101 CC01 DE06 1 101 AA12 BB101 CC01 DE06
1 AA11 102 BB101 CC01 234 EE07 2 102 234 AA11 BB101 CC01 EE07
2 AA10 202 BB101 CC01 345 EE09 3 202 345 AA10 BB101 CC01 EE09
3 AA13 103 BB101 CC02 123 4 103 123 AA13 BB101 CC02
4 AA14 203 BB101 CC02 456 5 203 456 AA14 BB101 CC02
5 AA15 204 BB102 CC03 567 6 204 567 AA15 BB102 CC03