Question

我正在尝试使用Scrapy构建一个蜘蛛，它返回多个页面的数据。到目前为止，我很擅长从第一页抓取数据，但我很难走得更远。到目前为止，这是我的代码：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class AutoscoutSpider(scrapy.Spider):
    name = 'autoscout'
    allowed_domains = ['www.autoscout24.de']
    start_urls = ['https://www.autoscout24.de/ergebnisse?mmvmk0=29&mmvco=1&cy=D&powertype=kw&atype=C&ustate=N%2CU&sort=standard&desc=0']


    def parse(self, response):
        car_name = response.css(".cldt-summary-makemodel::text").extract()
        car_functions = response.css(".cldt-summary-subheadline.sc-font-m.sc-ellipsis::text").extract()
        car_price = response.css(".cldt-price.sc-font-xl.sc-font-bold::text").extract()
        filtered_car_price = filter(lambda x: x not in '\n\n€,-\n', car_price)

        for item in zip(car_name,filtered_car_price,car_functions):
            zipped_info = {
                            'name' : item[0],
                            'price' : item[1],
                            'description' : item[2],
                                             }

            yield zipped_info

我尝试使用LinkExtractor来获取以下页面的网址：

rules = (Rule(LinkExtractor(allow=(), restrict_css=('.next-page',)),
         callback="parse_item", follow=True))

因此，我确保将parse函数调整为parse_item，以防止覆盖scrapy的基本函数。我想我在restrict_css参数中遗漏了一些内容，但我不确定它是什么。

Answer 1

查看页面源代码，您可以看到导航链接未在html中定义，而是有一个模板，后来由javascript填充：

// list of words
WITH ['Natur','Einheit','Vielheit'] AS texts

// find paths that include three words
MATCH path=(word1:Word )-[:NEXT*1..8]->(word2:Node)-[:NEXT*1..8]->(word3:Node)

// where each word is in your list
WHERE word1.text in texts
AND word2.text in texts
AND word3.text in texts

// and none of the words are the same
AND word1.text <> word2.text
AND word2.text <> word3.text
RETURN path

从我做过的一些简单测试看来，只需添加一个页面参数即可到达列表中的其他页面。
但是，似乎<div class="cl-pagination"> <ul class="sc-pagination" data-previous-text="Zurück" data-next-text="Weiter" data-page-size="20" data-current-page="1" data-total-items="86141" data-page-template="/ergebnisse?powertype=kw&pricetype=public&cy=D&mmvmk0=29&mmvco=1&zipr=1000&sort=standard&ustate=N&ustate=U&atype=C&page={page}&size={size}"></ul> </div>和size都限制为20，因此您只能将400次搜索结果限制为一次。

Scrapy：抓取多个页面的问题

1 个答案: