我正在使用Goose引擎使用以下代码从网址中提取文章文本:
g = Goose()
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
看起来这个网址存在问题,因为我收到以下错误:
'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
The string that could not be encoded/decoded was: �
我正确地将utf-8指定为我文件顶部的编解码器,如下所示:
# -*- coding: utf-8 -*-
我该如何解决这个问题?
编辑: 堆栈跟踪:
Environment:
Request Method: GET
Request URL: http://localhost:3000/scansources/
Django Version: 1.5.1
Python Version: 2.7.2
Installed Applications:
('django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
'django.contrib.messages',
'django.contrib.staticfiles',
'summaries',
'sources_scan')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware')
Traceback:
File "/Library/Python/2.7/site-packages/django/core/handlers/base.py" in get_response
115. response = callback(request, *callback_args, **callback_kwargs)
File "/Users/yonatanoren/Documents/python/summarizer/sources_scan/views.py" in scan_sources
183. article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in extract
53. return self.crawl(cc)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in crawl
60. article = crawler.crawl(crawl_candiate)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py" in crawl
90. article.top_node = extractor.calculate_best_node(article)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/extractors.py" in calculate_best_node
248. text_node = self.parser.getText(node)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/parsers.py" in getText
179. txts = [i for i in node.itertext()]
Exception Type: UnicodeDecodeError at /scansources/
Exception Value: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
感谢。
编辑: 使用python shell我得到了与此代码相同的错误:
>>> g = Goose()
>>> article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
我还更新了所有文件以使用以下内容,但仍然会收到错误。
#encoding=utf-8
我相信这可能是Goose本身的一个问题。因为Goose处理文本并返回它。在这种情况下我该如何解决?
编辑:
以下没有任何区别text = unicode(article.cleaned_text,'utf-8')
答案 0 :(得分:1)
您可以尝试raw_html提取:https://github.com/grangier/python-goose#known-issues
您可以使用原始html进行一些编码/解码。
答案 1 :(得分:0)
对所有字符串使用unicode可能会有所帮助:
在python文件的第一行插入from __future__ import unicode_literals
并重新尝试...
答案 2 :(得分:0)
尝试在字符串之前添加一点u。我没有看到任何奇怪的角色,但我通常在我的django代码中使用希伯来语并且顶部的bash并不总是足够
article = g.extract(url=u"http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
答案 3 :(得分:0)
即使我无法使用此URL重现错误,我也遇到了与python-goose类似的问题。尝试:
from goose.configuration import Configuration
from goose import Goose
config = Configuration()
config.parser_class = 'soupparser' # this helped me
g = Goose(config)
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")