我在跑蜘蛛时遇到问题。当我对它进行爬网时,它显示出这样的错误:“未处理HTTP状态代码”。
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=OPPO-182%22%3EOPPO%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=OPPO-182%22%3EOPPO%3C/a%3E>: HTTP status code is not handled or not allowed
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Smartfren-178%22%3ESmartfren%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Evercoss-184%22%3EEvercoss%20Evercoss%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Nokia-109%22%3ENokia%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=HUAWEI-69%22%3EHUAWEI%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Smartfren-178%22%3ESmartfren%3C/a%3E>: HTTP status code is not handled or not allowed
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Evercoss-184%22%3EEvercoss%20Evercoss%3C/a%3E>: HTTP status code is not handled or not allowed
我已按照另一条说明编辑setting.py并添加了代码:
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
但它仍然无法正常工作。
这是我的代码:
import scrapy
from handset.items import HandsetItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class HandsetpriceSpider(scrapy.Spider):
name = 'price'
allowed_domains = ['id.priceprice.com']
start_urls = ['http://id.priceprice.com/harga-hp/']
def parse(self, response):
rules = (
Rule(LinkExtractor(allow='div.listCont:nth-child(2) > ul:nth-child(1)'), callback='parse_details'),
Rule(LinkExtractor(restrict_css='ul > li > a[href*="maker"]'), follow =True)
)
for url in response.xpath('//ul[1]//li/a[contains(@href, "maker")]').extract() :
url = response.urljoin(url)
yield scrapy.Request(url, callback = self.parse_details)
next_page_url = response.css('li.last > a::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
yield {
'Name' : response.css('div.itmName h3:nth-child(1) > a:nth-child(1) ::text').extract_first(),
'Price' : response.css('div.itmPrice > a.price ::text').extract_first(),
}
答案 0 :(得分:0)
您的选择器从URL中获得了很多:
scrapy shell http://id.priceprice.com/harga-hp/
In [3]: response.xpath('//ul[1]//li/a[contains(@href, "maker")]').extract()
Out[3]:
['<a href="/harga-hp/?maker=OPPO-182">OPPO</a>',
'<a href="/harga-hp/?maker=Vivo-466">Vivo</a>',
'<a href="/harga-hp/?maker=Vivo-466">Vivo</a>',
....
因此,链接中包含a href和名称。 仅切出链接部分:
In [4]: response.xpath('//ul[1]//li/a[contains(@href, "maker")]').css('a::attr(href)').extract()
Out[4]:
['/harga-hp/?maker=OPPO-182',
'/harga-hp/?maker=Vivo-466',
'/harga-hp/?maker=Vivo-466',
并在代码中使用此选择器,您将得到:
2018-08-27 04:53:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://id.priceprice.com/harga-hp/?maker=Meizu-95>
{'Name': 'Meizu M6', 'Price': '\nRp 1.150.000\n - '}
{'Name': 'Infinix HOT 6 Pro', 'Price': '\nRp 1.599.000\n - '}