如何使用Spacy稳健地从新闻文章中删除引用?

时间:2019-12-27 19:00:23

标签: python spacy

我想标记/删除新闻文章中的报价。目标是执行NER并过滤掉引号中出现的所有实体。

此Github问题具有用于删除单个句子中的引号的代码:https://github.com/explosion/spaCy/issues/3196

但是,我需要删除整个(通常是格式错误的)文档中的引号。

我在下面的实现似乎(在某种程度上)适用于格式良好的文档。但是我在while循环中反复运行匹配器,这似乎效率很低。

作为参考,CoreNLP也具有类似的注释器:https://stanfordnlp.github.io/CoreNLP/quote.html#description

有人使用Spacy更好/更强大的方法有想法吗?

from spacy.matcher import Matcher

class QuotedTextMatcher(object):
    """
    TODO: add some protections for missing/dangling quotation marks
    such as maximum quotation length and perhaps checking that there
    is no space after the first quote and before the last quote
    Modified From: https://github.com/explosion/spaCy/issues/3196
    """

    def __init__(self, nlp):
        self.matcher = Matcher(nlp.vocab)
        quoted_text_pattern = [
            [{"ORTH": '"'}, {"OP": "+"}, {"ORTH": '"'},],
        ]
        self.matcher.add("QUOTED", None, *quoted_text_pattern)

    def __call__(self, doc):
        while True:
            matches = self.matcher(doc)
            if len(matches) == 0:
                return doc
            match_id, start, end = matches[0]
            span = doc[start:end]
            print("merging", span, len(span))
            span.merge()
            for word in span.subtree:
                word.tag_ = "quote"
        return doc


if __name__ == "__main__":

    import spacy
    nlp = spacy.load("en_core_web_lg")
    quoted_text_matcher = QuotedTextMatcher(nlp)
    nlp.add_pipe(quoted_text_matcher, first=True)

0 个答案:

没有答案