Question

我有一个大型的pandas数据帧。列包含分为句子的文本，每行一个句子。我需要检查句子中是否存在各种本体中使用的术语。一些本体相当大，有超过100,000个条目。此外，一些本体包含带有连字符，逗号和其他字符的分子名称，这些字符在要检查的文本中可能存在或可能不存在，因此需要正则表达式。

我想出了下面的代码，但是处理我的数据还不够快。欢迎任何建议。谢谢！

import pandas as pd                                                             
import re                                                                       


sentences = ["""There is no point in driving yourself mad trying to stop        
             yourself going mad""",                                             
             "The ships hung in the sky in much the same way that bricks don’t"]

sentence_number = list(range(0, len(sentences)))                                
d = {'sentence' : sentences, 'number' : sentence_number}                        

df = pd.DataFrame(d)                                                            

regexes = ['\\bt\\w+', '\\bs\\w+']                                             

big_regex = '|'.join(regexes)                                                   
compiled_regex = re.compile(big_regex, re.I)                                    

df['found_regexes'] = df.sentence.str.findall(compiled_regex)

在Pandas列中标识正则表达式列表

0 个答案: