如何选择包含所有关键字的行?

时间:2018-01-04 07:34:00

标签: python pandas

我有2个csv文件,如下所示:

文件-1

procedure   code
anand database  321-87
shiva network   321-123
jana audit  321-56
kalai recruitment   321-10

在file-1中,一行中的每个单词都是一个关键词。

文件-2

s.no    procedure
1   kalai has a recruitment group
2   shiva is the network person in my office
3   he is the auditor in my office
4   anand is the database here
5   i bought a new phone this week
6   jana is working in the audit team

在上面的场景中,我需要选择file-2中的行,其中包含file-1中每行的所有关键字。例如,假设文件-1中的第1行包含2个关键字' anand' &安培; '数据库&#39 ;.我需要选择文件-2中的行,其中包含关键字' anand' &安培; '数据库&#39 ;.

任何人都可以帮助我吗?

2 个答案:

答案 0 :(得分:1)

如果df相对较小,您可以使用str.contains。首先,从df构建模式。

df

           procedure     code
0     anand database   321-87
1      shiva network  321-123
2         jana audit   321-56
3  kalai recruitment   321-10

p = df.procedure.str.split().str.join('.*?').str.cat(sep='|')

p
'anand.*?database|shiva.*?network|jana.*?audit|kalai.*?recruitment'

现在,将其传递给str.contains上的df2.procedure

df2[df2.procedure.str.contains(p)]

   s.no                                 procedure
0     1             kalai has a recruitment group
1     2  shiva is the network person in my office
3     4                anand is the database here
5     6         jana is working in the audit team

答案 1 :(得分:1)

除了正则表达式之外的另一个解决方案是flashtext,如果你有更多的关键字,这将更快,即

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(df['procedure'].str.split().sum())

df2[df2['procedure'].apply(keyword_processor.extract_keywords).str.len()>1]

    s.no                                procedure
0     1             kalai has a recruitment group
1     2  shiva is the network person in my office
3     4                anand is the database here
5     6         jana is working in the audit team 

要了解有关此库及其速度的更多信息,请参阅此处

进一步阅读:

  1. Docs

  2. Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

  3. 需要通过pandas轻松实现界面,让我们等到它完成。