Question

我正在构建这个蜘蛛，我非常确定正确的xpath，因为我在scrapy shell上检查了它。

我不确定它出了什么问题。请帮帮我。

代码：

import scrapy

   class ExampleSpider(scrapy.Spider):
   name = 'example'

   def start_requests(self):

      yield scrapy.Request('http://www.example.com/search?q=%s' % self.query,callback=self.parse)

   def parse(self,response):
      start_urls=[]
      for i in range(0,10):
        link=str(response.css("div.search_blocks a::attr(href)")[i].extract())
        start_urls.append(link)
      for url in start_urls:
        print(url)
        yield scrapy.Request(url=url, callback=self.parse_product_info)

   def parse_product_info(self, response):
    price=str(response.xpath("//div[@class='price']/span[@class='f_price']/text()").extract_first())
    title=str(response.xpath("//*[@class='prd_mid_info']/h1/text()").extract_first())
    product_rating=str(response.xpath("//div[@class='rr']//span[@itemprop='ratingValue']/text()").extract_first())
    if product_rating=='':
        product_rating='none'
    else:
        product_rating=product_rating[3:]
    product_rating_count=str(response.xpath("//div[@class='rr']//span[@itemprop='ratingCount']/text()").extract_first())
    item_specifics='none'
    seller_name=str(response.xpath("//div[@itemprop='seller']/h3/text()").extract_first())
    shipping_cost=str(response.xpath("//span[@id='shipcharge']/text()").extract_first())
    if shipping_cost=='':
        shipping_cost='none'
    seller_rating=str(response.xpath("//div[@itemprop='seller']//span[@class='val']/text()").extract_first())

    scraped_info = {

        'url' : url,
        'price' : price,
        'discount_price' : discount_price,
        'title' : title,
        'product_rating' : product_rating,
        'product_rating_count' : product_rating_count,
        'item_specifics' : item_specifics,
        'seller_name' : seller_name,
        'shipping_cost' : shipping_cost,
        'seller_rating' : seller_rating,
    }

    yield scraped_info

P.S：在＆＃39; def parse_product_info（自我，响应）之后，StackOverflow上的缩进可能看起来是错误的：＆＃39;但在IDE中似乎很好。

我输入了以下命令进行抓取：

scrapy crawl <spider_name> -a query=toys

错误消息是：

2017-12-01 09:53:27 [scrapy.core.engine] INFO: Spider opened
2017-12-01 09:53:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-01 09:53:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-01 09:53:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.example.com/robots.txt> (referer: None)
2017-12-01 09:53:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.example.com/search?q=toys> (referer: None)
2017-12-01 09:53:29 [scrapy.core.scraper] ERROR: Spider error 
processing <GET https://www.example.com/search?q=toys> (referer: None)
Traceback (most recent call last):
File "/home/svbunndbest/.local/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/svbunndbest/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/svbunndbest/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/svbunndbest/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/svbunndbest/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/svbunndbest/New Volume E/Zenshanti Internship/tuts/tuts/spiders/shop.py", line 13, in parse
link=str(response.xpath("//*[@id='product_list']/div[3]/div[1]/a/@href")[i].extract())
File "/home/svbunndbest/.local/lib/python3.5/site-packages/parsel/selector.py", line 61, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2017-12-01 09:53:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-12-01 09:53:29 [scrapy.statscollectors] INFO: Dumping Scrapy 
stats:
{'downloader/request_bytes': 453,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 33468,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 12, 1, 4, 23, 29, 322286),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 52031488,
'memusage/startup': 52031488,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2017, 12, 1, 4, 23, 27, 836078)}
2017-12-01 09:53:29 [scrapy.core.engine] INFO: Spider closed (finished)

Answer 1

您有此错误

    ValueError: Missing scheme in request 
url: //www.example.com/something-else.html

尝试使用 urljoin

并将您的代码更改为

from urlparse import urljoin
url = urljoin("http://", url)

Scrapy：无法解决错误

1 个答案: