我编写了一些用户定义的函数,用于从文本句子/段落列表中删除Python中的命名实体(使用NLTK)。我遇到的问题是我的方法非常慢,特别是对于大量数据。有没有人建议如何优化它以使其运行更快?
import nltk
import string
# Function to reverse tokenization
def untokenize(tokens):
return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())
# Remove named entities
def ne_removal(text):
tokens = nltk.word_tokenize(text)
chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
return(untokenize(tokens))
要使用代码,我通常会有一个文本列表,并通过列表推导调用ne_removal
函数。示例如下:
text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']
更新:我尝试使用此代码切换到批处理版本,但它只是稍微快一些。会继续探索。感谢您到目前为止的输入。
def extract_nonentities(tree):
tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
return(untokenize(tokens))
def fast_ne_removal(text_list):
token_list = [nltk.word_tokenize(text) for text in text_list]
tagged = nltk.pos_tag_sents(token_list)
chunked = nltk.ne_chunk_sents(tagged)
non_entities = []
for tree in chunked:
non_entities.append(extract_nonentities(tree))
return(non_entities)
答案 0 :(得分:2)
每次调用ne_chunk()
时,都需要初始化一个chunker对象并加载统计模型以便从磁盘中进行分块。同上pos_tag()
。因此,不要一次只调用一个句子,而是在完整的文本列表中调用它们的批处理版本:
all_data = [ nltk.word_tokenize(sent) for sent in list_of_all_sents ]
tagged = nltk.pos_tag_sents(all_data)
chunked = nltk.ne_chunk_sents(tagged)
这应该会给你带来相当大的加速。如果这仍然太慢,无法满足您的需求,请尝试分析您的代码并考虑是否需要切换到更强大的工具,如@Lenz建议的那样。