使用nltk word_tokenize时出错

时间:2017-03-09 08:53:31

标签: python nltk

我正在从NLTK书中做一些关于从网络和磁盘访问文本的练习(第3章)。当调用word_tokenize时,我收到一个错误。

这是我的代码:

>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> tokens = nltk.word_tokenize(raw)

这是追溯:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: cannot use a string pattern on a bytes-like object
>>>   File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries

有人可以向我解释一下这里发生了什么以及为什么我似乎无法正确使用word_tokenize?

非常感谢!

2 个答案:

答案 0 :(得分:4)

您必须使用decode('utf-8')将html(以字节对象形式获取)转换为字符串:

>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> raw = raw.decode('utf-8')
>>> tokens = nltk.word_tokenize(raw)

答案 1 :(得分:0)

我收到了 url 的 Error 404,所以我更改了 url。这对我有用。您可以将网址更改为下面。也许它也适合你。

from urllib import request
url = "https://ia803405.us.archive.org/21/items/crimeandpunishme02554gut/2554.txt"
raw = request.urlopen(url).read()