文件-1

procedure   code
anand database  321-87
shiva network   321-123
jana audit  321-56
kalai recruitment   321-10

在file-1中，一行中的每个单词都是一个关键词。

文件-2

s.no    procedure
1   kalai has a recruitment group
2   shiva is the network person in my office
3   he is the auditor in my office
4   anand is the database here
5   i bought a new phone this week
6   jana is working in the audit team

在上面的场景中，我需要选择file-2中的行，其中包含file-1中每行的所有关键字。例如，假设文件-1中的第1行包含2个关键字＆＃39; anand＆＃39; ＆安培; ＆＃39;数据库＆＃39 ;.我需要选择文件-2中的行，其中包含关键字＆＃39; anand＆＃39; ＆安培; ＆＃39;数据库＆＃39 ;.

任何人都可以帮助我吗？

Answer 1

如果df相对较小，您可以使用str.contains。首先，从df构建模式。

df

           procedure     code
0     anand database   321-87
1      shiva network  321-123
2         jana audit   321-56
3  kalai recruitment   321-10

p = df.procedure.str.split().str.join('.*?').str.cat(sep='|')

p
'anand.*?database|shiva.*?network|jana.*?audit|kalai.*?recruitment'

现在，将其传递给str.contains上的df2.procedure。

df2[df2.procedure.str.contains(p)]

   s.no                                 procedure
0     1             kalai has a recruitment group
1     2  shiva is the network person in my office
3     4                anand is the database here
5     6         jana is working in the audit team

Answer 2

除了正则表达式之外的另一个解决方案是flashtext，如果你有更多的关键字，这将更快，即

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(df['procedure'].str.split().sum())

df2[df2['procedure'].apply(keyword_processor.extract_keywords).str.len()>1]

    s.no                                procedure
0     1             kalai has a recruitment group
1     2  shiva is the network person in my office
3     4                anand is the database here
5     6         jana is working in the audit team

要了解有关此库及其速度的更多信息，请参阅此处

进一步阅读：

Docs
Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

需要通过pandas轻松实现界面，让我们等到它完成。

如何选择包含所有关键字的行？

文件-1

文件-2

2 个答案: