滥用nltk的word_tokenize(已发送)的后果

时间:2013-10-15 04:27:10

标签: python nltk

我试图将一个段落分成单词。我手头上得到了可爱的nltk.tokenize.word_tokenize(发送),但是帮助(word_tokenize)说,"这个标记器设计用于一次一个句子。"

有人知道如果你在一个段落上使用它会发生什么,即最多5个句子吗?我自己在几个短段落上尝试过它似乎有效,但这并不是确凿的证据。

2 个答案:

答案 0 :(得分:7)

nltk.tokenize.word_tokenize(text)只是一个瘦wrapper function,它调用TreebankWordTokenizer类实例的tokenize方法,它显然使用简单的正则表达式来解析一个句子。

该类的文档声明:

  

此标记生成器假定文本已被分段       句子。任何时期 - 除了字符串末尾的时期 -       被认为是他们所附单词的一部分(例如       缩写等),并没有单独标记。

底层tokenize方法本身非常简单:

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

基本上,该方法通常做的是将句点标记为单独的标记,如果它落在字符串的末尾:

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

在字符串中的任何句点都被标记为单词的一部分,假设它是缩写:

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要这种行为可以接受,你应该没事。

答案 1 :(得分:1)

尝试这种黑客攻击:

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

然后很可能跟随代码也是您需要计算频率=)

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
...     print i, fdist[i]