我有2个csv文件,如下所示:
procedure code
anand database 321-87
shiva network 321-123
jana audit 321-56
kalai recruitment 321-10
在file-1中,一行中的每个单词都是一个关键词。
s.no procedure
1 kalai has a recruitment group
2 shiva is the network person in my office
3 he is the auditor in my office
4 anand is the database here
5 i bought a new phone this week
6 jana is working in the audit team
在上面的场景中,我需要选择file-2中的行,其中包含file-1中每行的所有关键字。例如,假设文件-1中的第1行包含2个关键字' anand' &安培; '数据库&#39 ;.我需要选择文件-2中的行,其中包含关键字' anand' &安培; '数据库&#39 ;.
任何人都可以帮助我吗?
答案 0 :(得分:1)
如果df
相对较小,您可以使用str.contains
。首先,从df
构建模式。
df
procedure code
0 anand database 321-87
1 shiva network 321-123
2 jana audit 321-56
3 kalai recruitment 321-10
p = df.procedure.str.split().str.join('.*?').str.cat(sep='|')
p
'anand.*?database|shiva.*?network|jana.*?audit|kalai.*?recruitment'
现在,将其传递给str.contains
上的df2.procedure
。
df2[df2.procedure.str.contains(p)]
s.no procedure
0 1 kalai has a recruitment group
1 2 shiva is the network person in my office
3 4 anand is the database here
5 6 jana is working in the audit team
答案 1 :(得分:1)
除了正则表达式之外的另一个解决方案是flashtext,如果你有更多的关键字,这将更快,即
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(df['procedure'].str.split().sum())
df2[df2['procedure'].apply(keyword_processor.extract_keywords).str.len()>1]
s.no procedure
0 1 kalai has a recruitment group
1 2 shiva is the network person in my office
3 4 anand is the database here
5 6 jana is working in the audit team
要了解有关此库及其速度的更多信息,请参阅此处
进一步阅读:
Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.
需要通过pandas轻松实现界面,让我们等到它完成。