更新文本中的标记词

时间:2014-11-27 23:17:44

标签: python python-2.7

我有一个标记文字。标记的文本包含一些不正确的标记词。因此,我为原始标记器无法标记的单词创建了基本规则标记器。我想在标记的文本中只替换错误的标记词。 标记文本的格式为

il/P ragazzo/V vuole/V andare/V a/P scuola/V`

正确标签的格式为:

[(u'porta', 'NN'), (u'scuola', 'NN'), (u'ragazzo', 'NN')]

输出将是`

il/P ragazzo/NN vuole/V andare/V a/P scuola/NN`

我尝试创建两个词典:一个用于标记文本,一个用于正确的标记,然后在键相同时替换值,但字典不尊重文本的原始顺序,但是给了我输出无序。有人可以知道如何替换原始文本中不正确的标记词。感谢

2 个答案:

答案 0 :(得分:1)

您可以使用字典作为标签,然后在循环中将输入转换为输出,保留顺序:

input = 'hil/P ragazzo/V vuole/V andare/V a/P scuola/V'
rules = [(u'porta', 'NN'), (u'scuola', 'NN'), (u'ragazzo', 'NN')]

rules_dict = {rule[0]: rule[1] for rule in rules}

parts = []
for token in input.split():
    word, type = token.split('/')
    if word in rules_dict:
        parts.append(word + '/' + rules_dict[word])
    else:
        parts.append(token)

output = ' '.join(parts)
print(output)

答案 1 :(得分:0)

您可以使用nltk.str2tuple模块将标记的字符串转换为元组,然后浏览第一个列表,如果correct_tag_list的第一个元素中有相同的元素,则从correct_tag_list中选择该项}(k)否则从第一个列表本身(i,j)中选择该元素:

>>> from nltk.tag.util import str2tuple
>>> s1=[(unicode(i),j) for i,j in [str2tuple(i) for i in s.split()]]
>>> l_first=[i[0] for i in l]
>>> [tuple(k for k in l if i==k[0])[0] if i in l_first else (i,j) for i,j in s1]
    [(u'il', 'P'), (u'ragazzo', 'NN'), (u'vuole', 'V'), (u'andare', 'V'), (u'a', 'P'), (u'scuola', 'NN')]

演示:

>>> s="il/P ragazzo/V vuole/V andare/V a/P scuola/V"
>>> l=[(u'porta', 'NN'), (u'scuola', 'NN'), (u'ragazzo', 'NN')]
>>> from nltk.tag.util import str2tuple
>>> [str2tuple(i) for i in s.split()]
[('il', 'P'), ('ragazzo', 'V'), ('vuole', 'V'), ('andare', 'V'), ('a', 'P'), ('scuola', 'V')]
>>> s1=[(unicode(i),j) for i,j in s1]
>>> s1
[(u'il', 'P'), (u'ragazzo', 'V'), (u'vuole', 'V'), (u'andare', 'V'), (u'a', 'P'), (u'scuola', 'V')]
>>> l_first=[i[0] for i in l]
>>> l_first
[u'porta', u'scuola', u'ragazzo']
>>> [tuple(k for k in l if i==k[0])[0] if i in l_first else (i,j) for i,j in s1]
[(u'il', 'P'), (u'ragazzo', 'NN'), (u'vuole', 'V'), (u'andare', 'V'), (u'a', 'P'), (u'scuola', 'NN')]

如果你不想起诉nltk.str2tuple使用转换字符串来使用split()

使用以下代码进行搜索
>>> [tuple(i.split('/')) for i in s.split()]
[('il', 'P'), ('ragazzo', 'V'), ('vuole', 'V'), ('andare', 'V'), ('a', 'P'), ('scuola', 'V')]