Question

我使用scrapy创建爬虫。并创建一些脚本来抓取许多页面。

不幸的是，并非所有脚本都抓取所有页面。某些页面返回所有页面，而其他页面只返回23或180（每个URL的结果不同）。

import scrapy

class BotCrawl(scrapy.Spider)
    name = "crawl-bl2"
    start_urls = [
        'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93',
    ]

    def parse(self, response):
        for product in response.css("ul[class='products row-grid']"):
            for product in product.css('li'):
                yield {
                 'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),

                 'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),

                 'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),

                 'kota': product.css('div[class=user-city] a::text').extract(),

                 'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()

            }

        # next page    

        next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

它阻止了http请求，或者我的代码可能出现错误？

经过Granitosaurus

编辑后的更新代码

仍有错误

return blank array

import scrapy


class BotCrawl(scrapy.Spider):
    name = "crawl-bl2"
    start_urls = [
        'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93',
    ]


def parse(self, response):
    products = response.css('article.product-display')
    for product in products:
        yield {
        'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
        'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
        'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
        'kota': product.css('div[class=user-city] a::text').extract(),
        'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
        }


    # next page    

    next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first()
    last_url = "/c/perawatan-kecantikan/perawatan-wajah?page=100&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93"
    if next_page_url is not last_url:
      yield scrapy.Request(response.urljoin(next_page_url),dont_filter=True)

谢谢

Answer 1

您的产品xpath有点不可靠。直接尝试精选的产品文章，该网站使您可以轻松地使用css选择器：

products = response.css('article.product-display')
for product in products:
    yield {
        'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
        'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
        'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
        'kota': product.css('div[class=user-city] a::text').extract(),
        'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
    }

您可以通过插入inspect_response来调试响应：

def parse(self, response):
    products = response.css('article.product-display')
    if not products:
        from scrapy.shell import inspect_response
        inspect_response(response, self)
        # will open up python shell here where you can check `response` object
        # try `view(response)` to open it up in your browser and such.

Scrapy在抓取时不会处理所有页面

1 个答案: