我有一本字典,其中的每个键都是一个句子,值是该句子中的特定单词或短语。
例如:
dict1 = {'it is lovely weather and it is kind of warm':['lovely weather', 'it is kind of warm'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}
我希望根据短语是否在字典值中来标记每个句子的输出。
在此示例中,输出为(其中0不在值中,而1在值中)
*
it 0
is 0
lovely weather 1 (combined because it's a phrase)
and 0
it is kind of warm 1 (combined because it's a phrase)
*
and 0
the 0
weather 0
is 0
rainy and cold 1 (combined because it's a phrase)
...(and so on)...
我可以使类似的代码起作用,但是只能通过对短语中的单词数进行硬编码:
for k,v in dict1.items():
words_in_val = v.split()
if len(words_in_val) == 1:
words = k.split()
for each_word in words:
if v == each_word:
print(each_word + '\t' + '1')
else:
print(each_word + '\t' + '0')
if len(words_in_val) == 2::
words = k.split()
for index,item in enumerate(words[:-1]):
if words[index] == words_in_val[0]:
if words[index+1] == words_in_val[1]:
words[index] = ' '.join(words_in_val)
words.remove(words[index+1])
....something like this...
我的问题是我可以看到它开始变得凌乱,而且从理论上讲,我想匹配的短语中可以包含无限数量的单词,尽管通常是<10。
有人对此有更好的主意吗?
答案 0 :(得分:0)
这就是我的做法:
from collections import defaultdict
dict1 = {'it is lovely weather and it is kind of warm':['it is kind of', 'it is kind'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}
def tag_sentences(dict):
id = 1
tagged_results = []
for sentence, phrases in dict.items():
words = sentence.split()
phrases_split = [phrase.split() for phrase in phrases]
positions_keeper = {}
sentence_results = [(word, 0) for word in words]
for word_index, word in enumerate(words):
for index, phrase in enumerate(phrases_split):
position = positions_keeper.get(index, 0)
if phrase[position] == word:
if len(phrase) > position + 1:
positions_keeper[index] = position + 1
else:
for i in range(len(phrase)):
sentence_results[word_index - i] = (sentence_results[word_index - i][0], id)
id = id + 1
else:
positions_keeper[index] = 0
tagged_results.append(sentence_results)
return tagged_results
def print_tagged_results(tagged_results):
for tagged_result in tagged_results:
memory = 0
memory_sentence = ""
for result, id in tagged_result:
if memory != 0 and memory != id:
print(memory_sentence + "1")
memory_sentence = ""
if id == 0:
print(result, 0)
else:
memory_sentence += result + " "
memory = id
if memory != 0:
print(memory_sentence + "1")
tagged_results = tag_sentences(dict1)
print_tagged_results(tagged_results)
这基本上是在做以下事情:
[(it, 0), (is, 0), (lovely, 0) ...]
[(it, 0), (is, 0), (lovely, 1) ... (kind,2), (of, 2), ...]
如果一个短语是另一个短语的副词,将不起作用,但是您在示例中从未提到过该短语应如何应对这种情况。