我有一个标记文字。标记的文本包含一些不正确的标记词。因此,我为原始标记器无法标记的单词创建了基本规则标记器。我想在标记的文本中只替换错误的标记词。 标记文本的格式为
il/P ragazzo/V vuole/V andare/V a/P scuola/V`
正确标签的格式为:
[(u'porta', 'NN'), (u'scuola', 'NN'), (u'ragazzo', 'NN')]
输出将是`
il/P ragazzo/NN vuole/V andare/V a/P scuola/NN`
我尝试创建两个词典:一个用于标记文本,一个用于正确的标记,然后在键相同时替换值,但字典不尊重文本的原始顺序,但是给了我输出无序。有人可以知道如何替换原始文本中不正确的标记词。感谢
答案 0 :(得分:1)
您可以使用字典作为标签,然后在循环中将输入转换为输出,保留顺序:
input = 'hil/P ragazzo/V vuole/V andare/V a/P scuola/V'
rules = [(u'porta', 'NN'), (u'scuola', 'NN'), (u'ragazzo', 'NN')]
rules_dict = {rule[0]: rule[1] for rule in rules}
parts = []
for token in input.split():
word, type = token.split('/')
if word in rules_dict:
parts.append(word + '/' + rules_dict[word])
else:
parts.append(token)
output = ' '.join(parts)
print(output)
答案 1 :(得分:0)
您可以使用nltk.str2tuple
模块将标记的字符串转换为元组,然后浏览第一个列表,如果correct_tag_list
的第一个元素中有相同的元素,则从correct_tag_list
中选择该项}(k
)否则从第一个列表本身(i,j
)中选择该元素:
>>> from nltk.tag.util import str2tuple
>>> s1=[(unicode(i),j) for i,j in [str2tuple(i) for i in s.split()]]
>>> l_first=[i[0] for i in l]
>>> [tuple(k for k in l if i==k[0])[0] if i in l_first else (i,j) for i,j in s1]
[(u'il', 'P'), (u'ragazzo', 'NN'), (u'vuole', 'V'), (u'andare', 'V'), (u'a', 'P'), (u'scuola', 'NN')]
演示:
>>> s="il/P ragazzo/V vuole/V andare/V a/P scuola/V"
>>> l=[(u'porta', 'NN'), (u'scuola', 'NN'), (u'ragazzo', 'NN')]
>>> from nltk.tag.util import str2tuple
>>> [str2tuple(i) for i in s.split()]
[('il', 'P'), ('ragazzo', 'V'), ('vuole', 'V'), ('andare', 'V'), ('a', 'P'), ('scuola', 'V')]
>>> s1=[(unicode(i),j) for i,j in s1]
>>> s1
[(u'il', 'P'), (u'ragazzo', 'V'), (u'vuole', 'V'), (u'andare', 'V'), (u'a', 'P'), (u'scuola', 'V')]
>>> l_first=[i[0] for i in l]
>>> l_first
[u'porta', u'scuola', u'ragazzo']
>>> [tuple(k for k in l if i==k[0])[0] if i in l_first else (i,j) for i,j in s1]
[(u'il', 'P'), (u'ragazzo', 'NN'), (u'vuole', 'V'), (u'andare', 'V'), (u'a', 'P'), (u'scuola', 'NN')]
如果你不想起诉nltk.str2tuple
使用转换字符串来使用split()
>>> [tuple(i.split('/')) for i in s.split()]
[('il', 'P'), ('ragazzo', 'V'), ('vuole', 'V'), ('andare', 'V'), ('a', 'P'), ('scuola', 'V')]