Question

请原谅我的菜鸟问题，因为我是一个scrapy初学者。

我在scrapy shell和我的蜘蛛之间遇到了一个奇怪的区别，使用相同的xpath查询。蜘蛛设置为遵循“下一页”页面链接，然后解析结果。

查询：

response.xpath('//div/div/span/a[starts-with(.,"Next")]/@href')

蜘蛛码：

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "mich"
    allowed_domains = ["lib-web.org"]
    start_urls = [
        "http://www.lib-web.org/united-states/public-libraries/michigan/"
    ]

    def parse(self, response):
        for sel in response.xpath('//div/div/div/ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('p/text()').extract()

            yield item

        next_page = response.xpath('//div/div/span/a[starts-with(.,"Next")]/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            print "#################################################################"
            print url
            print "#################################################################"
            yield scrapy.Request(url, self.parse)

DmozItem：

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

当我在shell中运行查询时，不实际显示下一页的href，其中应包含“/page-2.html”（从start_urls开始）：

Admin$ scrapy shell "http://www.lib-web.org/united-states/public-libraries/michigan/"
...
In [1]: response.xpath('//div/div/span/a[starts-with(.,"Next")]/@href')
Out[1]: [<Selector xpath='//div/div/span/a[starts-with(.,"Next")]/@href' 
data=u'/united-states/public-libraries/michigan'>]

但是，具有此确切查询的蜘蛛会为下一页（/page-2.html）找到正确的href。完整的下一页href如下所示：

http://www.lib-web.org/united-states/public-libraries/michigan/page-2.html

壳完全不是什么。

那蜘蛛是如何运作的呢？当shell查询未显示时，蜘蛛如何获得下一页？

顺便说一句，如果我在shell查询中添加“.extract（）”，它现在会显示下一页网址，这就是我想要看到的内容：

In [1]: response.xpath('//div/div/span/a[starts-with(.,"Next")]/@href').extract()
Out[1]: [u'/united-states/public-libraries/michigan/page-2.html']

但是如果我对蜘蛛使用“.extract（）”，则会出现以下错误：

 url = response.urljoin(next_page[0].extract())
AttributeError: 'unicode' object has no attribute 'extract'

谢谢！

Answer 1

好。我自己刚回答像往常一样，注意错误信息是很好的。

在shell查询中，我没有使用.extract（），这就是为什么我没有看到我想要的URL。蜘蛛实际上使用相同的查询正常运行，因为我在执行urljoin时执行.extract（）：

url = response.urljoin(next_page[0].extract())

因此，.extract（）确实需要发生。它可以在urljoin中，也可以在之前的response.xpath查询中。

scrapy shell和spider之间相同xpath查询的奇怪结果。为什么？

1 个答案: