Question

我使用Srapys链接提取器失败了。 E.g：

scrapy shell "http://www.dachser.com/de/de/"
# within shell
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
SgmlLinkExtractor().extract_links(response)
# yields: SGMLParseError: expected name token at '<!/IoRangeRedDotMode'

现在，我只需要所有链接的列表，这就是我从SgmlLinkExtractor切换到基本HtmlParserLinkExtractor的原因。这适用于上面的网址，但让我们采取另一个网址，甚至失败：

scrapy shell "http://www.yourfirm.de"
# within shell
from scrapy.contrib.linkextractors.htmlparser import HtmlParserLinkExtractor
HtmlParserLinkExtractor().extract_links(response)
# yields: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)

这是怎么回事？我计划提取各种网站的链接，因此非常欢迎更加万无一失的链接提取。

更新：好的，我发现通过将utf-8设置为systemdefault编码see here，可以在Windows上解决ascii错误。现在其他人也失败了..就像导致scrapy shell "http://grunwald-wangen.de"的{{1}}。

Answer 1

HtmlParserLinkExtractor将response.body传递给HTMLParser。

更改源代码以便它接收response.body_as_unicode()修复问题。 doc表示建议使用unicode。我在github上做了pull request。

正如Berendt在评论中指出的那样，SgmlLinkExtractor似乎会阻止一些格式错误的HTML。

Scrapy linkextractors失败

1 个答案: