
时间:2016-02-08 16:55:48

标签: python nlp nltk

我正在使用NLTK学习自然语言处理。 我使用PunktSentenceTokenizer找到了代码,我在给定代码中无法理解其实际用途。代码是:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A

tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B

def process_content():
    for i in tokenized[:5]:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)

except Exception as e:





4 个答案:

答案 0 :(得分:23)

PunktSentenceTokenizer是NLTK中提供的默认句子标记化器的抽象类,即sent_tokenize()。这是Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005)的实施。见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79


>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '


>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
This evening I will set forth policies to advance that ideal at home and around the world. 


alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README


>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 

要训练自己的朋克模型,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.pytraining data format for nltk punkt

答案 1 :(得分:13)

PunktSentenceTokenizer是一种句子边界检测算法,必须经过训练才能使用[1]。 NLTK已经包含了PunktSentenceTokenizer的预训练版本。


In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']

您还可以提供自己的训练数据,以便在使用前训练标记器。 Punkt tokenizer使用无监督算法,这意味着您只需使用常规文本训练它。

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)


那么“所有这些与POS标签有什么关系”? NLTK POS标记符使用标记化的句子,因此您需要在将文本标记为POS标记之前将文本分解为句子和单词标记。

NLTK's documentation.

[1] Kiss and Strunk,“ Unsupervised Multilingual Sentence Boundary Detection

答案 2 :(得分:1)

您可以参考以下链接,了解PunktSentenceTokenizer的使用情况。 它生动地解释了为什么使用PunktSentenceTokenizer而不是sent-tokenize()关于你的情况。


答案 3 :(得分:0)

def process_content(corpus):

    tokenized = PunktSentenceTokenizer().tokenize(corpus)

        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
    except Exception as e:

