Question

我是NLTK的新手，对python来说还是新手。我想用我自己的数据集来训练和测试NLTK的Perceptron标记器。培训和测试数据具有以下格式（它只保存在txt文件中）：

Pierre  NNP
Vinken  NNP
,       ,
61      CD
years   NNS
old     JJ
,       ,
will    MD
join    VB
the     DT
board   NN
as      IN
a       DT
nonexecutive    JJ
director        NN
Nov.    NNP
29      CD
.       .

我想在数据上调用这些函数：

perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(train_data)
accuracy = perceptron_tagger.evaluate(test_data)

我尝试了一些事情，但我无法弄清楚数据预期的格式。任何帮助都将受到赞赏！感谢

Answer 1

train()的{{1}}和evaluate()函数的输入需要一个元组列表列表，其中每个内部列表是一个列表，每个元组是一对字符串。

给定PerceptronTagger和train.txt：

test.txt

将CoNLL格式的文件读入元组列表。

$ cat train.txt 
This foo
is  foo
a   foo
sentence    bar
.   .

That    foo
is  foo
another foo
sentence    bar
in  foo
conll   bar
format  bar
.   .

$ cat test.txt 
What    foo
is  foo
this    foo
sentence    bar
?   ?

How foo
about   foo
that    foo
sentence    bar
?   ?

现在您可以训练/评估标记器：

# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]

# Or otherwise

>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
...     """
...     From http://stackoverflow.com/a/25226944/610569
...     """
...     ret = []
...     for line in it:
...         if is_delimiter(line):
...             if ret:
...                 yield ret  # OR  ''.join(ret)
...                 ret = []
...         else:
...             ret.append(line.rstrip())  # OR  ret.append(line)
...     if ret:
...         yield ret
... 
>>> 
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]

使用PerceptronTagger读取我自己的NLTK词性标注数据集

1 个答案: