熊猫使用正则表达式映射两个数据框

时间:2020-10-27 21:22:30

标签: python regex pandas

我有两个数据框,一个具有文本信息,另一个具有正则表达式和模式,我需要做的是使用正则表达式映射第二个数据框中的一列

编辑:我需要做的是在所有df ['text']行上应用每个正则表达式,如果有匹配项,则将Pattern添加到新列中

样本数据

47ad-91fc-a4b1d163081b/2.175.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"
2020-10-27T20:36:40.0448279Z ##[debug]/usr/bin/dotnet arg: --configuration Release -r win10-x64
2020-10-27T20:36:40.0454259Z ##[debug]exec tool: /usr/bin/dotnet
2020-10-27T20:36:40.0454556Z ##[debug]arguments:
2020-10-27T20:36:40.0454825Z ##[debug]   build
2020-10-27T20:36:40.0455170Z ##[debug]   /home/vsts/work/1/s/someFunction.csproj
2020-10-27T20:36:40.0456258Z ##[debug]   -dl:CentralLogger,"/home/vsts/work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.175.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"*ForwardingLogger,"/home/vsts/work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.175.0/dotnet-build-helpers/Microsoft.TeamFoundation.DistributedTask.MSBuild.Logger.dll"
2020-10-27T20:36:40.0457276Z ##[debug]   --configuration
2020-10-27T20:36:40.0457663Z ##[debug]   Release
2020-10-27T20:36:40.0463060Z ##[debug]   -r
2020-10-27T20:36:40.0463499Z ##[debug]   win10-x64
2020-10-27T20:36:40.0465119Z [command]/usr/bin/dotnet build /home/vsts/work/1/s/someFunction.csproj -dl:CentralLogger,"/home/vsts/work/_tasks/DotNetCoreCLI_5541a522-603c-47ad-91fc-a4b1d163081b/2.175.0/dotnet-build-helpers

df

text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
             'sales for the overseas customers',
             'marketing approach is driving strong play from top tier customers',
             'employees in India have been the continuance of remote work will impact productivity',
             'sales due to higher customer']}

regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
             'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
                       '(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
                       '(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}

regex

                                                text
0  customer and increased repair and remodel acti...
1                   sales for the overseas customers
2  marketing approach is driving strong play from...
3  employees in India have been the continuance o...
4                       sales due to higher customer

所需的输出

                   Pattern                                              regex
0         Sales + customer  (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1     Marketing + customer  (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2  Employee * Productivity  (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...

尝试以下操作,创建了一个函数,该函数在出现匹配项时返回Pattern,然后遍历正则表达式数据帧中的所有列

                                                text    Pattern
0  customer and increased repair and remodel acti...    Sales + customer
1                   sales for the overseas customers    Sales + customer
2  marketing approach is driving strong play from...    Marketing + customer
3  employees in India have been the continuance o...    Employee * Productivity
4                       sales due to higher customer    Sales + customer

问题在于,在每次迭代中,它都会擦除以前的映射,如下所示。因为我是foo,所以foo是最后一次迭代,是唯一剩下的带有模式的

def finding_keywords(regex, match, keyword):
    if re.search(regex, match):
        return keyword
    else:
        pass

for index, row in regex.iterrows():
    df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))

一种解决方案是在regex数据帧上运行迭代,然后在df上进行迭代,这样可以避免丢失信息,但我正在寻找最快的解决方案

1 个答案:

答案 0 :(得分:1)

您可以遍历regex数据帧的唯一值并将其应用于text帧的df,然后在新的regex列中返回模式。然后,合并到Pattern列中并放下regex列。

我的方法的关键是首先将列创建为NaN,然后在每次迭代中填充fillna,以免列被覆盖。

import re
import numpy as np

srs = regex['regex'].unique()
df['regex'] = np.nan

for reg in srs:
    df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg 
                               if re.search(reg, x) else np.NaN))

df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)

df

Out[1]: 
                                                text                  Pattern
0  customer and increased repair and remodel acti...         Sales + customer
1                   sales for the overseas customers         Sales + customer
2  marketing approach is driving strong play from...     Marketing + customer
3  employees in India have been the continuance o...  Employee * Productivity
4                       sales due to higher customer         Sales + customer