我有两个不同的数据框,如图所示
df1
==================================
KEYWORD TICKET
Burst of bit errors 89814
sync and stand-by reload 66246
Port sub-modules modelling 70946
wires stop passing traffic 60245
Ignore Net flow 59052
df2
==========================
TEXT_DATA
Burst of bit errors due to
stop passing traffic
具有部分匹配。请帮帮我。这是我开发的代码片段
import pandas as pd
Standard_Data = pd.read_excel('bOOK2.xlsx',usecols=[0,1])
print(Standard_Data)
#Standard_Data
==================================
KEYWORD TICKET
Burst of bit errors 89814
sync and stand-by reload 66246
Port sub-modules modelling 70946
wires stop passing traffic 60245
Ignore Net flow 59052
keyword_data = Standard_Data['KEYWORD'].values.tolist()
input_data = pd.read_excel('book1.xlsx',usecols=[1])
print(input_data)
input_data
==========================
TEXT_DATA
Burst of bit errors due to
stop passing traffic
#simply df1 = Standard_Data , df2 = input_Data
sentenced_data = input_data['Text_Data'].values.tolist()
df = pd.DataFrame({'sentenced_data':sentenced_data})
print(df)
df['MATCHED_KEYWORD'] = (df['sentenced_data'].apply(lambda x: [w for i in
keyword_data
for w in i.split(' ')
if w in (x)]))
df['KEYWORD'] = df['MATCHED_KEYWORD'].apply(','.join)
df['KEYWORD'] = df['KEYWORD'].str.replace(',',' ')
Z = Standard_Data.merge(df,on='KEYWORD',how='right')
print(Z)
我得到的结果就像
KEYWORD TICKET sentenced_data
Burst of bit errors NaN Burst of bit errors due to
stop passing traffic NaN stop passing traffic
但我想要的结果应该是这样的
KEYWORD sentenced_data TICKET
Burst of bit errors Burst of bit errors due to 89814
wires stop passing traffic stop passing traffic 66246
请有人帮我解决这个问题
答案 0 :(得分:1)
请尝试以下代码:
df是你的第一个数据帧,df1是第二个数据框
res = pd.DataFrame()
for text in df1.TEXT_DATA:
res = res.append(
df[df.apply(lambda row: row.KEYWORD in text or
text in row.KEYWORD, axis=1)]
)
print(res)
输出:
KEYWORD TICKET
0 Burst of bit errors 89814
3 wires stop passing traffic 60245
<小时/> 这是另一种方法,可以使完全与预期输出相同:
res = pd.DataFrame(columns=['KEYWORD', 'TICKET', 'sentenced_data']) # create an empty dataframe to store the answer
for text in df1.TEXT_DATA: # loop-through the second dataframe
bools = df.apply(lambda row: row.KEYWORD in text or text in row.KEYWORD, axis=1) # return a boolean series if KEYWORD contains(by the "in" keyword in python) text or text contains KEYWORD
if (bools.any()): # filter the df by the boolean series, append it to res, append the text to second column
res = res.append(df[bools])
res.iloc[-bools.sum():, 2] = text
res = res[['KEYWORD', 'sentenced_data', 'TICKET']]
print(res)
输出:
KEYWORD sentenced_data TICKET
0 Burst of bit errors Burst of bit errors due to 89814
3 wires stop passing traffic stop passing traffic 60245
<小时/> 如果两个
in
部分匹配或100%匹配,则python中的True
关键字将返回string
;否则返回False
。