我是NLTK的新手,对python来说还是新手。我想用我自己的数据集来训练和测试NLTK的Perceptron标记器。培训和测试数据具有以下格式(它只保存在txt文件中):
Pierre NNP
Vinken NNP
, ,
61 CD
years NNS
old JJ
, ,
will MD
join VB
the DT
board NN
as IN
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .
我想在数据上调用这些函数:
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(train_data)
accuracy = perceptron_tagger.evaluate(test_data)
我尝试了一些事情,但我无法弄清楚数据预期的格式。任何帮助都将受到赞赏!感谢
答案 0 :(得分:2)
train()
的{{1}}和evaluate()
函数的输入需要一个元组列表列表,其中每个内部列表是一个列表,每个元组是一对字符串。
给定PerceptronTagger
和train.txt
:
test.txt
将CoNLL格式的文件读入元组列表。
$ cat train.txt
This foo
is foo
a foo
sentence bar
. .
That foo
is foo
another foo
sentence bar
in foo
conll bar
format bar
. .
$ cat test.txt
What foo
is foo
this foo
sentence bar
? ?
How foo
about foo
that foo
sentence bar
? ?
现在您可以训练/评估标记器:
# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
# Or otherwise
>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
... """
... From http://stackoverflow.com/a/25226944/610569
... """
... ret = []
... for line in it:
... if is_delimiter(line):
... if ret:
... yield ret # OR ''.join(ret)
... ret = []
... else:
... ret.append(line.rstrip()) # OR ret.append(line)
... if ret:
... yield ret
...
>>>
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]