Question

我正在抓取网页，但没有得到预期的输出。

我正在学习网页抓取，并且还是一个初学者。问题在于并非所有的报价都被取消。

import scrapy

class QuoteSpider(scrapy.Spider):
    name = 'Quotes'
    start_urls = [
    'http://quotes.toscrape.com/'
    ]
    def parse(self, response):
        for quotes in response.selector.xpath("//div[@class='quote']"):
            yield{
            'text':quotes.xpath("//span[@class='text']/text()").extract_first(),
            'author':quotes.xpath("//small[@class='author']/text()").extract_first(),
            'tags':quotes.xpath("//div[@class='tags']/child::a/text()").extract(),
            }

我希望第一页上的所有引号都应被删除。相反，我一次又一次得到相同的报价和作者，但是每次都提取所有标签。我还是一个初学者。我会感谢您的帮助。

Answer 1

在嵌套选择器上使用xpath时，这是一个常见错误。

在已提取的选择器上使用xpath时，如果要使用已提取的内容作为新xpath选择器的根，则需要以.开始xpath。如果您不这样做，它将照常使用所有DOM。

所以只需将最后几行更改为：

{
    'text':quotes.xpath(".//span[@class='text']/text()").extract_first(),
    'author':quotes.xpath(".//small[@class='author']/text()").extract_first(),
    'tags':quotes.xpath(".//div[@class='tags']/child::a/text()").extract(),
}

没有获得预期的输出

1 个答案: