如何在特定说明中提取多个关键字

时间:2019-01-28 05:36:57

标签: pandas dataframe nlp

这是我的数据集

No   Description
1    Paying Google ads
2    Purchasing Facebook Ads
3    Purchasing Ads
4    AirBnB repayment

我有txt个文件,名为entity.txt

0, Google
1, Facebook
2, Ads

我需要的是检测数据框中entity.txt上的所有关键字,只有一个或多个关键字,如果没有检测到一个关键字,我们将其称为Other,因此我的输出期望是:

No   Description                 Keyword
1    Paying Google ads           Google
2    Purchasing Facebook Ads     Facebook Ads
3    Purchasing LinkedIn Ads     LinkedIn Ads
4    AirBnB repayment            Other

这就是我所做的

with open('entity.txt') as f: 
    content = f.readlines()
content = [x.strip() for x in content ]
df['keyword'] = df['description'].apply(lambda x: ' '.join([i for i in content if i in x]))
df['keyword'] = df['keyword'].replace('', 'Other')

但是,结果是

No   Description                 Keyword
1    Paying Google ads           Other
2    Purchasing Facebook Ads     Other
3    Purchasing LinkedIn Ads     Other
4    AirBnB repayment            Other

3 个答案:

答案 0 :(得分:3)

使用str.findallhttp://user:pass@host:port/path中的所有值提取到列表中,然后将空列表转换为df1,所有填充的列表都以str.join进行空格连接:

Other

您的解决方案:

df1 = pd.DataFrame({'entity':['Google','Facebook','Ads']})

s = df['Description'].str.findall(r'({})'.format('|'.join(df1['entity'])))
df['Keyword'] = np.where(s.astype(bool), s.str.join(' '), 'Other')
print (df)

   No              Description       Keyword
0   1        Paying Google ads        Google
1   2  Purchasing Facebook Ads  Facebook Ads
2   3  Purchasing LinkedIn Ads           Ads
3   4         AirBnB repayment         Other

替代:

s = df['Description'].apply(lambda x: [i for i in set(df1['entity']) if i in x])
df['Keyword'] = np.where(s.astype(bool), s.str.join(' '), 'Other')
print (df)
   No              Description       Keyword
0   1        Paying Google ads        Google
1   2  Purchasing Facebook Ads  Facebook Ads
2   3  Purchasing LinkedIn Ads           Ads
3   4         AirBnB repayment         Other

答案 1 :(得分:2)

使用findall

df.Description.str.findall(('|'.join(s.tolist()))).str[0]
0      Google
1    Facebook
2         Ads
3         NaN
Name: Description, dtype: object
df['Keyword']=df.Description.str.findall(('|'.join(s.tolist()))).str[0]

数据输入

s
0      Google
1    Facebook
2         Ads
Name: s, dtype: object

答案 2 :(得分:2)

使用str.extract()

df['Keyword']=df.Description.str.extract(r'({})'.format('|'.join(df1[1],)))
print(df)

  No              Description    Keyword
0   1        Paying Google ads     Google
1   2  Purchasing Facebook Ads   Facebook
2   3  Purchasing LinkedIn Ads        Ads
3   4         AirBnB repayment        NaN