Question

我有一本字典，其中的每个键都是一个句子，值是该句子中的特定单词或短语。

例如：

dict1 = {'it is lovely weather and it is kind of warm':['lovely weather', 'it is kind of warm'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

我希望根据短语是否在字典值中来标记每个句子的输出。

在此示例中，输出为（其中0不在值中，而1在值中）

*
it 0
is 0
lovely weather 1 (combined because it's a phrase)
and 0
it is kind of warm 1 (combined because it's a phrase)
*
and 0
the 0
weather 0
is 0
rainy and cold 1 (combined because it's a phrase)
...(and so on)...

我可以使类似的代码起作用，但是只能通过对短语中的单词数进行硬编码：

for k,v in dict1.items():
   words_in_val = v.split()
   if len(words_in_val) == 1:
      words = k.split()
      for each_word in words:
             if v == each_word:
                   print(each_word + '\t' + '1')
             else:
                   print(each_word + '\t' + '0')


     if len(words_in_val) == 2::
         words = k.split()
         for index,item in enumerate(words[:-1]):
                if words[index] == words_in_val[0]:
                       if words[index+1] == words_in_val[1]:
                              words[index] = ' '.join(words_in_val)
                              words.remove(words[index+1])
                              ....something like this...

我的问题是我可以看到它开始变得凌乱，而且从理论上讲，我想匹配的短语中可以包含无限数量的单词，尽管通常是<10。

有人对此有更好的主意吗？

Answer 1

这就是我的做法：

from collections import defaultdict

dict1 = {'it is lovely weather and it is kind of warm':['it is kind of', 'it is kind'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

def tag_sentences(dict):
    id = 1
    tagged_results = []
    for sentence, phrases in dict.items():
        words = sentence.split()
        phrases_split = [phrase.split() for phrase in phrases]
        positions_keeper = {}
        sentence_results = [(word, 0) for word in words]
        for word_index, word in enumerate(words):
            for index, phrase in enumerate(phrases_split):
                position = positions_keeper.get(index, 0)
                if phrase[position] == word:
                    if len(phrase) > position + 1:
                        positions_keeper[index] = position + 1
                    else:
                        for i in range(len(phrase)):
                            sentence_results[word_index - i] = (sentence_results[word_index - i][0], id)
                        id = id + 1
                else:
                    positions_keeper[index] = 0
        tagged_results.append(sentence_results)
    return tagged_results

def print_tagged_results(tagged_results):
    for tagged_result in tagged_results:
        memory = 0
        memory_sentence = ""
        for result, id in tagged_result:
            if memory != 0 and memory != id:
                print(memory_sentence + "1")
                memory_sentence = ""
            if id == 0:
                print(result, 0)
            else:
                memory_sentence += result + " "
            memory = id
        if memory != 0:
            print(memory_sentence + "1")

tagged_results = tag_sentences(dict1)
print_tagged_results(tagged_results)

这基本上是在做以下事情：

首先，我以[(it, 0), (is, 0), (lovely, 0) ...]
在标记的列表中，我将0标记为“>”，而不是将其他整数组合在一起（标记为1的单词组合在一起，标记为2的单词组合在一起）
我会遍历每个单词并标记是否与短语开头匹配，或者如果我已经处于当前短语位置的循环中
如果这是短语的结尾，我会标记该单词以及所有过去匹配过的单词，并使用相同的ID
如果还没有结束，我将保留该职位并开始下一次迭代。
最后，我有一个标记列表，格式为[(it, 0), (is, 0), (lovely, 1) ... (kind,2), (of, 2), ...]

如果一个短语是另一个短语的副词，将不起作用，但是您在示例中从未提到过该短语应如何应对这种情况。

Python：将字典值中的短语匹配到句子（字典键）并根据匹配结果输出

1 个答案: