如source code所述,word_tokenize
在运行单词tokenizer(Treebank)之前运行一个句子标记器(Punkt):
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()
def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:param text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)]
在单词标记化之前进行句子标记化有什么好处?
答案 0 :(得分:2)
使用的NLTK中的默认标记生成器(nltk.word_tokenize
)是来自TreebankWordTokenizer
的Michael Heilman's tokenizer.sed
我们在tokenizer.sed
中看到,它指出:
# Assume sentence tokenization has been done first, so split FINAL periods only.
s=\([^.]\)\([.]\)\([])}>"']*\)[ ]*$=\1 \2\3 =g
此正则表达式将始终分割最后一个句点,并且假设事先执行句子标记化。
保持树库标记器,nltk.tokenize.treebank.TreebankWordTokenizer
执行相同的正则表达式操作和documenting the behavior in the class docstring:
class TreebankWordTokenizer(TokenizerI):
"""
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This is the method that is invoked by ``word_tokenize()``. It assumes that the
text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
This tokenizer performs the following steps:
- split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll``
- treat most punctuation characters as separate tokens
- split off commas and single quotes, when followed by whitespace
- separate periods that appear at the end of line
"""
更具体地说,“出现在行尾的单独时段”是指this particular regex:
# Handles the final period.
# NOTE: the second regex is the replacement during re.sub()
re.compile(r'([^\.])(\.)([\]\)}>"\']*)\s*$'), r'\1 \2\3 ')
假设在单词标记化之前执行句子标记化是否常见?
也许,也许不是;取决于您的任务以及您如何评估任务。如果我们查看其他单词标记符,我们会看到它们执行相同的最终周期分割,例如在Moses (SMT) tokenizer:
# Assume sentence tokenization has been done first, so split FINAL periods only.
$text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g;
同样在NLTK port of the Moses tokenizer:
# Splits final period at end of string.
FINAL_PERIOD = r"""([^.])([.])([\]\)}>"']*) ?$""", r'\1 \2\3'
中
对于不希望对其句子进行句子分割的用户,preserve_line
选项可用,因为https://github.com/nltk/nltk/issues/1710代码合并=)
有关原因和内容的更多说明,请参阅https://github.com/nltk/nltk/issues/1699