如何在nltk中使用hunpos标记文本文件?

时间:2011-02-23 08:18:15

标签: python nltk corpus pos-tagger

有人可以帮我解决在nltk中使用标记语料库的语法吗?

  1. 我要为hunpos.HunPosTagger module导入什么内容?

  2. 我如何对语料库进行HunPosTag?请参阅以下代码。


  3. import nltk 
    from nltk.corpus import PlaintextCorpusReader  
    from nltk.corpus.util import LazyCorpusLoader  
    
    corpus_root = './'  
    reader = PlaintextCorpusReader (corpus_root, '.*')  
    
    ntuen = LazyCorpusLoader ('ntumultien', PlaintextCorpusReader, reader)  
    ntuen.fileids()  
    isinstance (ntuen, PlaintextCorpusReader)  
    
    
    # So how do I hunpos tag `ntuen`? I can't get the following code to work.
    # please help me to correct my python syntax errors, I'm new to python 
    # but i really need this to work. sorry
    ##from nltk.tag import hunpos.HunPosTagger
    ht = HunPosTagger('english.model')
    for sentence in ntu.sent() ##looping through the no. of sentence
         ht.tag(ntusent()[i])
    

1 个答案:

答案 0 :(得分:4)

import nltk 
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize

corpus = "so how do i hunpos tag my ntuen ? i can't get the following code to work."
#please help me to correct my python syntax errors, i'm new to python 
#but i really need this to work. sorry
##from nltk.tag import hunpos.HunPosTagger
ht = HunposTagger('en_wsj.model')
print ht.tag(word_tokenize(corpus))

我觉得问题是你没有对单词进行标记,但是还有其他原因导致代码无效(它是HunposTagger,而不是HunPosTagger)。我从你的问题中做了这个简化的例子。如果您还有其他问题,请发表评论。

我从这里得到了所有东西:http://code.google.com/p/hunpos/

  

python hunpos.py

     

[('so','RB'),('how','WRB'),('do','VBP'),('i','FW'),('hunpos',' NN'),('tag','NN'),('my','PRP $'),('ntuen','NN'),('?','。'),('i', 'FW'),('ca','MD'),(“not”,“RB”),('get','VB'),('the','DT'),('关注' ','JJ'),('code','NN'),('to','TO'),('work','VB'),('。','。')]