Question

我正在尝试创建一种类似英语的小语言来指定任务。基本思想是将一个陈述分成这些动词应该适用的动词和名词短语。我正在使用nltk，但没有得到我希望的结果，例如：

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]

在每种情况下，它都没有意识到第一个单词（选择，移动和复制）是作为动词。我知道我可以创建自定义标记器和语法来解决这个问题，但与此同时，当很多这样的东西不在我的联盟中时，我对于重新发明轮子犹豫不决。我特别希望能够处理非英语语言的解决方案。

所以无论如何，我的问题是：这种语法有更好的标记吗？有没有办法可以比现有的标记更加频繁地使用动词形式？有没有办法训练标记器？有更好的方法吗？

Answer 1

一种解决方案是创建一个备份到NLTK标记器的手册UnigramTagger。像这样：

>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

然后你得到

>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]

只要你有一个合适的默认标记器，这种方法就适用于非英语语言。您可以使用nltk-trainer中的train_tagger.py和相应的语料库来训练自己的标记。

Answer 2

雅各布的回答很明显。但是，要扩展它，你可能会发现你需要的不仅仅是unigrams。

例如，考虑三个句子：

select the files
use the select function on the sockets
the select was good

这里，单词“select”分别用作动词，形容词和名词。 unigram标记器将无法对此进行建模。即使是一个二元组标签也无法处理它，因为其中两个案例共享相同的前一个词（即“the”）。你需要一个trigram标记来正确处理这种情况。

import nltk.tag, nltk.data
from nltk import word_tokenize
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)

def evaluate(tagger, sentences):
    good,total = 0,0.
    for sentence,func in sentences:
        tags = tagger.tag(nltk.word_tokenize(sentence))
        print tags
        good += func(tags)
        total += 1
    print 'Accuracy:',good/total

sentences = [
    ('select the files', lambda tags: ('select', 'VB') in tags),
    ('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),
    ('the select was good', lambda tags: ('select', 'NN') in tags),
]

train_sents = [
    [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],
    [('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],
    [('the', 'DT'), ('select', 'NN'), ('files', 'NNS')],
]

tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger)
evaluate(tagger, sentences)
#model = tagger._context_to_tag

注意，你可以使用NLTK的NgramTagger来训练一个使用任意大量n-gram的标记器，但通常你在三元组之后不会获得太多的性能提升。

Answer 3

见雅各布的回答。

在以后的版本中（至少nltk 3.2）nltk.tag._POS_TAGGER不存在。默认标记符通常下载到 nltk_data / taggers / 目录中，例如：

>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')

用法如下。

>>> import nltk.tag, nltk.data
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
>>> default_tagger = nltk.data.load(tagger_path)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

另见：How to do POS tagging using the NLTK POS tagger in Python。

Answer 4

巴德的回答是正确的。另外，根据this link，

如果您的nltk_data软件包已正确安装，那么NLTK知道它们在您的系统中的位置，并且您不需要传递绝对路径。

意思是，你可以说

tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
default_tagger = nltk.data.load(tagger_path)

使用nltk自定义标记

4 个答案: