我想标记/删除新闻文章中的报价。目标是执行NER并过滤掉引号中出现的所有实体。
此Github问题具有用于删除单个句子中的引号的代码:https://github.com/explosion/spaCy/issues/3196
但是,我需要删除整个(通常是格式错误的)文档中的引号。
我在下面的实现似乎(在某种程度上)适用于格式良好的文档。但是我在while循环中反复运行匹配器,这似乎效率很低。
作为参考,CoreNLP也具有类似的注释器:https://stanfordnlp.github.io/CoreNLP/quote.html#description
有人使用Spacy更好/更强大的方法有想法吗?
from spacy.matcher import Matcher
class QuotedTextMatcher(object):
"""
TODO: add some protections for missing/dangling quotation marks
such as maximum quotation length and perhaps checking that there
is no space after the first quote and before the last quote
Modified From: https://github.com/explosion/spaCy/issues/3196
"""
def __init__(self, nlp):
self.matcher = Matcher(nlp.vocab)
quoted_text_pattern = [
[{"ORTH": '"'}, {"OP": "+"}, {"ORTH": '"'},],
]
self.matcher.add("QUOTED", None, *quoted_text_pattern)
def __call__(self, doc):
while True:
matches = self.matcher(doc)
if len(matches) == 0:
return doc
match_id, start, end = matches[0]
span = doc[start:end]
print("merging", span, len(span))
span.merge()
for word in span.subtree:
word.tag_ = "quote"
return doc
if __name__ == "__main__":
import spacy
nlp = spacy.load("en_core_web_lg")
quoted_text_matcher = QuotedTextMatcher(nlp)
nlp.add_pipe(quoted_text_matcher, first=True)