我有不同的数据框架。如何找到匹配并合并数据帧

时间:2018-05-14 05:39:06

标签: python python-3.x pandas machine-learning

我有两个不同的数据框,如图所示

     df1
     ==================================                              
     KEYWORD                     TICKET
     Burst of bit errors          89814
     sync and stand-by reload     66246
     Port sub-modules modelling   70946
     wires stop passing traffic   60245
     Ignore Net flow              59052


     df2
     ==========================         
     TEXT_DATA
     Burst of bit errors due to
     stop passing traffic

具有部分匹配。请帮帮我。这是我开发的代码片段

import pandas as pd

Standard_Data = pd.read_excel('bOOK2.xlsx',usecols=[0,1])

print(Standard_Data)

     #Standard_Data
     ==================================                              
     KEYWORD                     TICKET
     Burst of bit errors          89814
     sync and stand-by reload     66246
     Port sub-modules modelling   70946
     wires stop passing traffic   60245
     Ignore Net flow              59052

keyword_data = Standard_Data['KEYWORD'].values.tolist()

input_data = pd.read_excel('book1.xlsx',usecols=[1])

print(input_data)

     input_data
     ==========================         
     TEXT_DATA
     Burst of bit errors due to
     stop passing traffic

#simply df1 = Standard_Data , df2 = input_Data
sentenced_data = input_data['Text_Data'].values.tolist()

df = pd.DataFrame({'sentenced_data':sentenced_data})

print(df)

df['MATCHED_KEYWORD'] = (df['sentenced_data'].apply(lambda x: [w for i in 
                                      keyword_data
                                      for w in i.split(' ')
                                      if w in (x)]))

df['KEYWORD'] = df['MATCHED_KEYWORD'].apply(','.join)

df['KEYWORD'] = df['KEYWORD'].str.replace(',',' ')

Z = Standard_Data.merge(df,on='KEYWORD',how='right')
print(Z)

我得到的结果就像

KEYWORD                 TICKET              sentenced_data

Burst of bit errors       NaN        Burst of bit errors due to   
stop passing traffic      NaN        stop passing traffic   

但我想要的结果应该是这样的

KEYWORD                           sentenced_data               TICKET
Burst of bit errors               Burst of bit errors due to   89814
wires stop passing traffic        stop passing traffic         66246

请有人帮我解决这个问题

1 个答案:

答案 0 :(得分:1)

请尝试以下代码:

df是你的第一个数据帧,df1是第二个数据框

res = pd.DataFrame()
for text in df1.TEXT_DATA:
        res = res.append(
              df[df.apply(lambda row: row.KEYWORD in text or
                                    text in row.KEYWORD, axis=1)]
              )
print(res)

输出:

                      KEYWORD TICKET
0         Burst of bit errors  89814
3  wires stop passing traffic  60245

<小时/> 这是另一种方法,可以使完全与预期输出相同:

res = pd.DataFrame(columns=['KEYWORD', 'TICKET', 'sentenced_data']) # create an empty dataframe to store the answer
for text in df1.TEXT_DATA: # loop-through the second dataframe
    bools = df.apply(lambda row: row.KEYWORD in text or text in row.KEYWORD, axis=1) # return a boolean series if KEYWORD contains(by the "in" keyword in python) text or text contains KEYWORD
    if (bools.any()): # filter the df by the boolean series, append it to res, append the text to second column
        res = res.append(df[bools])
        res.iloc[-bools.sum():, 2] = text
res = res[['KEYWORD', 'sentenced_data', 'TICKET']]
print(res)

输出:

                      KEYWORD              sentenced_data TICKET
0         Burst of bit errors  Burst of bit errors due to  89814
3  wires stop passing traffic        stop passing traffic  60245

<小时/> 如果两个in部分匹配或100%匹配,则python中的True关键字将返回string;否则返回False