Question

我一直在与SpaCy合作进行NLP项目，以获取所有实体的左右单词并将其转储为JSON格式。

这是我尝试过的功能：

def __init__(self):
    self.new_side_words_json = dict()

def side_words(self, text):
    words = nlp(text).ents[0]
    side_words_json = [{'LeftSideWord': str(words[entity.start - 1]),
                        'Entity': str(entity),
                        'RightSideWord': str(words[entity.end])}
                       if not words[entity.start - 1].is_punct 
                       and not words[entity.start - 1].is_space 
                       and not words[entity.end].is_punct
                       and not words[entity.end].is_space
                       else
                       {'LeftSideWord': str(words[entity.start - 2]),
                        'Entity': str(entity),
                        'RightSideWord': str(words[entity.end + 1])}
                       for entity in nlp(text).ents]
    self.new_side_words_json['SideWords'] = side_words_json

在某些情况下，此算法有效。但是，我认为这是一个非常丑陋的解决方案，因为它不能充分控制条件。该算法高度依赖于文本格式。我想构建一些适用于每个文档的实体。

我的意思是，在文本文件中，可以有许多标点符号或空格。我只是控制上下两个级别。

我想做的是，创建一种算法，该算法可以找到实体前后的有意义的单词，但不能找到标点符号或空格，甚至找不到停用词。

如何调整此算法以获取所有实体的上一个和下一个有意义的词？

Answer 1

我终于找到了解决方案。还是很丑。但是，它可以按我的意愿工作。

我将代码发布在这里，以便任何遇到相同类型问题的人都可以提出解决方案的想法。

    for entity in doc.ents:
        self.entity_list = [entity]
    right = [
        {'Right': str(words[entity.end])} if (entity.end < self.entity_list[-1].end) and not words[entity.end].is_punct and not words[entity.end].is_space
        else
        {'Right': str(words[entity.end + 1])} if (entity.end + 1 < self.entity_list[-1].end) and not words[entity.end + 1].is_punct and not words[entity.end + 1].is_space
        else
        {'Right': str(words[entity.end + 2])} if (entity.end + 2 < self.entity_list[-1].end) and not words[entity.end + 2].is_punct and not words[entity.end + 2].is_space
        else
        {'Right': 'null'}
        for entity in nlp(text).ents]
    result = [{**dict_left, **dict_entities, **dict_right} for
              dict_left, dict_entities, dict_right in
              zip(left, entities, right)]

问题在于索引正确的单词，在最后一个实体之后，没有单词。它抱怨试图达到最后一个目标。我添加了索引大小控制器来解决此问题。

我还必须分离if的JSON标签块，以获取每个标签的更精确结果。然后只需使用zip()

合并它们

使用SpaCy获取实体的左右单词

1 个答案: