Question

我只是想复制 - http://www.nltk.org/book/ch03.html给出的代码，用于从网络上读取数据。这一点在本章中提到。

>>> from urllib import request
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> response = request.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> type(raw)
<class 'str'>
>>> len(raw)
1176893
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

以下是我写的代码：

from urllib import request

#url = "http://www.gutenberg.org/files/2554/2554.txt"
#(original as per the chapter sample code.. but gives error hence
#changed to below url string.)
url = "http://www.gutenberg.org/files/2554/2554-h/2554-h.htm"
response = request.urlopen(url)
raw = response.read().decode('utf8')
print('data type of raw = ', type(raw))
print('length of raw = ', len(raw))
print('initial contents - ', raw[:175])
tokens = word_tokenize(raw)
print('tokens\n', tokens[:100])

预计这将返回没有html标签的文本。但是我正在使用标签输出。

请参阅以下tokens的输出：

['<', '?', 'xml', 'version=', "''", '1.0', "''", 'encoding=', "''", 'utf-8', "''", '?', '>', '<', '!', 'DOCTYPE', 'html', 'PUBLIC', '``', '-//W3C//DTD', 'XHTML', '1.0', 'Strict//EN', "''", '``', 'http', ':', '//www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd', "''", '>', '<', 'html', 'xmlns=', "''", 'http', ':', '//www.w3.org/1999/xhtml', "''", 'lang=', "''", 'en', "''", '>', '<', 'head', '>', '<', 'title', '>', 'Crime', 'and', 'Punishment', ',', 'by', 'Fyodor', 'Dostoevsky', '<', '/title', '>', '<', 'style', 'type=', "''", 'text/css', "''", 'xml', ':', 'space=', "''", 'preserve', "''", '>', 'body', '{', 'margin:5', '%', ';', 'background', ':', '#', 'faebd0', ';', 'text-align', ':', 'justify', '}', 'P', '{', 'text-indent', ':', '1em', ';', 'margin-top', ':', '.25em', ';', 'margin-bottom', ':', '.25em', ';']

如何将纯文本作为输出？

urlopen / read使用html标签而不是纯文本返回文本

0 个答案: