如何在每个句子中查找和匹配列表的每个元素?

时间:2019-04-09 08:53:33

标签: python python-3.x

我有一个包含一些句子的文件。我将多语言用于命名实体识别,并将所有检测到的实体存储在列表中。现在,我要检查每个句子中是否存在任何实体或成对实体,请为我显示。

这是我所做的:

from polyglot.text import Text

file = open('input_raw.txt', 'r')
input_file = file.read()
test = Text(input_file, hint_language_code='fa')

list_entity = []
for sent in test.sentences:
    #print(sent[:10], "\n")
    for entity in test.entities:
       list_entity.append(entity)

for i in range(len(test)):
    m = test.entities[i]
    n = test.words[m.start: m.end] # it shows only word not tag
    if str(n).split('.')[-1] in test: # if each entities exist in each sentence
         print(n)

它给了我一个空白列表。

输入:

 sentence1: Bill Gate is the founder of Microsoft.
 sentence2: Trump is the president of USA.

预期输出:

Bill Gate, Microsoft
Trump, USA

list_entity的输出:

I-PER(['Trump']), I-LOC(['USA'])

如何检查I-PER(['Trump'])I-LOC(['USA'])是否在第一句中?

1 个答案:

答案 0 :(得分:1)

对于初学者,您要将整个文本文件输入添加到实体列表中。 entities只能由多语对象中的每个句子调用。

from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='fa')

list_entity = []
for sentence in file.sentences:
    for entity in sentence.entities:
        #print(entity)
        list_entity.append(entity)

print(list_entity)

现在您没有空列表。


关于您识别身份字词的问题,

我还没有找到一种手动生成实体的方法,因此下面仅检查是否存在具有相同术语的实体。块内部可以有多个字符串,因此我们可以迭代地遍历它们。

from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='ar')

def check_sentence(entities_list, sentence): ## Check if string terms 
    for term in entities_list:               ## are in any of the entities
        ## Compare each Chunk in the list to each Chunk 
        ## object  in the sentence and see if there's any matches.
        if any(any(entityTerm == term for entityTerm in entityObject) 
               for entityObject in sentence.entities):
            pass
        else:
            return False
    return True

sentence_number = 1 # Which sentence to check
sentence = file.sentences[sentence_number]
entity_terms = ["Bill", 
                "Gates"]

if check_sentence(entity_terms, sentence):
    print("Entity Terms " + str(entity_terms) +  
          " are in the sentence. '" + str(sentence)+ "'")
else:
    print("Sentence '" + str(sentence) + 
          "' doesn't contain terms" + str(entity_terms ))

一旦找到一种生成任意实体的方法,您要做的就是停止从句子检查器中弹出该术语,以便您也可以进行类型比较。


如果您只想将文件中的实体列表与特定句子进行匹配,则可以做到这一点:

from polyglot.text import Text
file = open('input_raw.txt', 'r')
input_file = file.read()
file = Text(input_file, hint_language_code='fa')

def return_match(entities_list, sentence): ## Check if and which chunks
    matches = []                           ## are in the sentence
    for term in entities_list:                  
        ## Check each list in each Chunk object 
        ## and see if there's any matches.
        for entity in sentence.entities:
            if entity == term:
                for word in entity:
                    matches.append(word)
    return matches

def return_list_of_entities(file):
    list_entity = []
    for sentence in file.sentences:
        for entity in sentence.entities:
            list_entity.append(entity)
    return list_entity

list_entity = return_list_of_entities(file)
sentence_number = 1 # Which sentence to check
sentence = file.sentences[sentence_number]
match = return_match(list_entity, sentence)

if match:
    print("Entity Term " + str(match) +  
          " is in the sentence. '" + str(sentence)+ "'")
else:
    print("Sentence '" + str(sentence) + 
          "' doesn't contain any of the terms" + str(list_entity))