抓取访问内部URL

时间:2018-12-06 07:53:39

标签: python web-scraping scrapy

我在start_urls数组中有一个网址,如下所示:

start_urls = [
        'https://www.ebay.com/sch/tp_peacesports/m.html?_nkw=&_armrs=1&_ipg=&_from='
    ]

    def parse(self, response):
        shop_title = self.getShopTitle(response)
        sell_count = self.getSellCount(response)
        self.shopParser(response, shop_title, sell_count)


    def shopParser(self, response, shop_title, sell_count):
        items = EbayItem()
        items['shop_title'] = shop_title
        items['sell_count'] = sell_count
        if sell_count > 0:
            item_links = response.xpath('//ul[@id="ListViewInner"]/li/h3/a/@href').extract()
            for link in item_links:
                items['item_price'] = response.xpath('//span[@itemprop="price"]/text()').extract_first()

        yield items

现在位于for循环内的shopParser()中,我具有不同的链接,并且我需要的响应与来自start_urls的原始响应不同,响应,如何实现?

1 个答案:

答案 0 :(得分:1)

您需要调用对新页面的请求,否则您将不会获得任何新的html。尝试类似的东西:

SELECT json_group_array(json_object('rank', rank
                                  , 'name', name
                                  , 'director', director
                                  , 'year', year
                                  , 'rating', rating
                                  , 'starring', starring))
FROM movies;

这些新请求也将通过def parse(self, response): shop_title = response.meta.get('shop_title', self.getShopTitle(response)) sell_count = response.meta.get('sell_count', self.getSellCount(response)) # here you logic with item parsing if sell_count > 0: item_links = response.xpath('//ul[@id="ListViewInner"]/li/h3/a/@href').extract() # yield requests to next pages for link in item_links: yield scrapy.Request(response.urljoin(link), meta={'shop_title': shop_title, 'sell_count': sell_count}) 函数进行解析。或者,您可以根据需要设置另一个回调。