Question

我在跑蜘蛛时遇到问题。当我对它进行爬网时，它显示出这样的错误：“未处理HTTP状态代码”。

2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=OPPO-182%22%3EOPPO%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=OPPO-182%22%3EOPPO%3C/a%3E>: HTTP status code is not handled or not allowed
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Smartfren-178%22%3ESmartfren%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Evercoss-184%22%3EEvercoss%20Evercoss%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Nokia-109%22%3ENokia%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=HUAWEI-69%22%3EHUAWEI%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Smartfren-178%22%3ESmartfren%3C/a%3E>: HTTP status code is not handled or not allowed
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Evercoss-184%22%3EEvercoss%20Evercoss%3C/a%3E>: HTTP status code is not handled or not allowed

我已按照另一条说明编辑setting.py并添加了代码：

user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

但它仍然无法正常工作。

这是我的代码：

import scrapy
from handset.items import HandsetItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider


class HandsetpriceSpider(scrapy.Spider):
    name = 'price'
    allowed_domains = ['id.priceprice.com']
    start_urls = ['http://id.priceprice.com/harga-hp/']

    def parse(self, response):

        rules = (
                Rule(LinkExtractor(allow='div.listCont:nth-child(2) > ul:nth-child(1)'), callback='parse_details'),
                Rule(LinkExtractor(restrict_css='ul > li > a[href*="maker"]'), follow =True)                
               )
        for url in  response.xpath('//ul[1]//li/a[contains(@href, "maker")]').extract() :
            url = response.urljoin(url)
            yield scrapy.Request(url, callback = self.parse_details)

        next_page_url = response.css('li.last > a::attr(href)').extract_first()
        if next_page_url:
           next_page_url = response.urljoin(next_page_url)
           yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self, response):
        yield {
       'Name' : response.css('div.itmName h3:nth-child(1) > a:nth-child(1) ::text').extract_first(),
       'Price' : response.css('div.itmPrice > a.price ::text').extract_first(),
        }

Answer 1

您的选择器从URL中获得了很多：

scrapy shell http://id.priceprice.com/harga-hp/

In [3]: response.xpath('//ul[1]//li/a[contains(@href, "maker")]').extract()
Out[3]: 
['<a href="/harga-hp/?maker=OPPO-182">OPPO</a>',
 '<a href="/harga-hp/?maker=Vivo-466">Vivo</a>',
 '<a href="/harga-hp/?maker=Vivo-466">Vivo</a>',
....

因此，链接中包含a href和名称。仅切出链接部分：

In [4]: response.xpath('//ul[1]//li/a[contains(@href, "maker")]').css('a::attr(href)').extract()
Out[4]: 
['/harga-hp/?maker=OPPO-182',
 '/harga-hp/?maker=Vivo-466',
 '/harga-hp/?maker=Vivo-466',

并在代码中使用此选择器，您将得到：

2018-08-27 04:53:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://id.priceprice.com/harga-hp/?maker=Meizu-95>
{'Name': 'Meizu M6', 'Price': '\nRp 1.150.000\n - '}


{'Name': 'Infinix HOT 6 Pro', 'Price': '\nRp 1.599.000\n - '}

Scrapy错误：未处理或不允许HTTP状态代码

1 个答案: