我希望在 DF2 的Test1列中的字符串包含一个Test1列的子字符串时合并下面两个数据帧的行 DF1 。
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
print (DF1)
Test1 Test2
0 A 1
1 B 2
2 C 3
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print (DF2)
Test1 Test2
0 ee 1
1 bA 2
2 cCc 3
3 D 4
为此,我能用“str contains”来识别DF2.Test1
字符串中DF1.Test1的子字符串。INPUT:
for i in DF1.Test1:
ok = DF2[Df2.Test1.str.contains(i)]
print(ok)
OUPUT:
现在,我想在输出中添加Test1的子串的合并,它与Test2的字符串匹配
OUPUT:
为此,我试过“pd.merge”和“if”,但我还是找不到合适的代码.. 你有建议吗?
for i in DF1.Test1:
if DF2.Test1.str.contains(i) == 'True':
ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer')
print(ok)
感谢您的想法:)
答案 0 :(得分:4)
由于我的声誉,我无法接受jezrael的评论。但是我将他的答案改为了可以合并到非大写文本上的功能。
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
用于示例:
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print(str_merge(DF1,DF2, 'Test1'))
Test1_x Test2_x Test1_y Test2_y
0 B 2 bA 2
1 C 3 cCc 3
答案 1 :(得分:1)
我认为您需要extract
值到新列,然后merge
,最后删除辅助列Test3
:
pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
Test1_x Test2_x Test1_y Test2_y
0 A 1 bA 2
1 C 3 cCc 3
<强>详细强>:
print (DF2)
Test1 Test2 Test3
0 ee 1 NaN
1 bA 2 A
2 cCc 3 C
3 D 4 NaN