Question

我一直在尝试使用Mallet Simple Tagger（http://mallet.cs.umass.edu/sequences.php）来学习用于POS标记的CRF模型。

我现在开始担心/困惑，因为我的电脑已经学习了这个模型超过一周。它似乎没有被挂起，因为它会以下列形式给出输出：

...  
Punkte  NN->Puppenk�nig NN(Puppenk�nig  NN) Punkte  NN,Puppenk�nig  NN  
Punkte  NN->Obere   NN(Obere    NN) Punkte  NN,Obere    NN  
Punkte  NN->Entfernung  NN(Entfernung   NN) Punkte  NN,Entfernung   NN  
...

所以我想问一下，如果Mallet采取这么长的时间是正常的，还是出了问题呢？

我使用了网页上指定的命令：

hough@gobur:~/tagger-test$ java -cp  
 "/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar"
 cc.mallet.fst.SimpleTagger
 --train true --model-file nouncrf  sample

训练数据包含96903个代币。

编辑：
我们假设，它可能与输入的形式有关。该网站指定表格：

Bill CAPITALIZED noun  
slept non-noun   
here LOWERCASE STOPWORD non-noun

SimpleTagger（http://mallet.cs.umass.edu/api/）的文档声明每个实例应该是一个单独的块，用空行分隔。虽然我不确定实例是什么意思，但我想，预期的形式是这样的：

word pos  
word pos  
. $.  

word pos  
word pos  
word pos  
. $.  

word pos  
word pos    
. $.  

...

这是正确的格式吗？也许有人有一个示例文件，以显示格式应该是什么样的？

Answer 1

一个100k令牌语料库的一周似乎太长了。我估计最多只能半小时。

Mallet POS-Tagging学习时间

1 个答案: