Question

我尝试抓取的HTML代码格式不正确：

<html>
<head>...</head>
<body>
    My items here...
    My items here...
    My items here...

    Pagination here...
</body>
</head>
</html>

问题是第二个</head>。我必须替换我的蜘蛛中的HTML以使用xpath表达式：

class FooSpider(CrawlSpider):
    name = 'foo'
    allowed_domains = ['foo.bar']
    start_urls = ['http://foo.bar/index.php?page=1']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d',),),
              callback="parse_start_url",
              follow=True),)

def parse_start_url(self, response):
    # Remove the second </head> here
    # Perform my item

现在我想在我的规则中使用restrict_xpath参数，但我不能，因为HTML格式错误：此时尚未执行替换。

请问您有什么想法吗？

Answer 1

我要做的是写一个Downloader middleware并使用BeautifulSoup包来修复和美化response.body中包含的HTML - response.replace()可能很方便这种情况。

请注意，如果你选择BeautifulSoup，请仔细选择parser - 每个解析器都有自己的方式进入破碎的HTML - 有些人更少或更宽松。 lxml.html在速度方面会是最好的。

示例：

from bs4 import BeautifulSoup

class MyMiddleware(object):
    def process_response(self, request, response, spider):
        soup = BeautifulSoup(response.body, "lxml")
        response = response.replace(body=soup.prettify())

        return response

例如，修改下载的HTML的自定义中间件，请参阅scrapy-splash middleware。

Scrapy：restrict_css格式错误的HTML

1 个答案: