我想从https://wolt.com/ru/kaz/almaty抓取餐厅数据 网址(例如https://wolt.com/ru/kaz/almaty/restaurant/la-pizza-2)访问每个餐厅页面 这是我的代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from satu.items import QuoteItem
from scrapy.linkextractors import LinkExtractor
class QuotesSpiderSpider(CrawlSpider):
name = 'wolt'
allowed_domains = ['wolt.com']
start_urls = ['https://wolt.com/ru/kaz/almaty/restaurant/']
handle_httpstatus_list = [404, 302]
rules = (
Rule(LinkExtractor(allow=('/restaurant/')), callback='parse_item'))
def parse_item(self, response):
try:
title = response.xpath(
".//div[@class='VenueHeroBanner__container___1_lK2']/h1[@class='VenueHeroBanner__title___2EzpN']//text()").get()
except:
title = ['']
try:
time = response.xpath(
".//div[@class='VenueSide__infoLine___jrSHX']/div[@class='VenueSide__hours___122Zm']//text()").get()
except:
time = ['']
item = QuoteItem()
item["title"] = title
item["time"] = time
yield item
但是,它不会刮取任何数据。而且我不知道问题出在哪里。 输出是这样的:
2020-05-22 00:13:40 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: satu)
2020-05-22 00:13:40 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.0 (v3.6.0:41df79263a11, Dec 2
3 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.1, Platform Windows-10-10.0.18362-SP0
2020-05-22 00:13:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-22 00:13:40 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'satu',
'CONCURRENT_REQUESTS': 32,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 3,
'HTTPCACHE_IGNORE_HTTP_CODES': [301, 302],
'NEWSPIDER_MODULE': 'satu.spiders',
'REDIRECT_ENABLED': False,
'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408, 429],
'RETRY_TIMES': 1,
'SPIDER_MODULES': ['satu.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36'}
2020-05-22 00:13:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-22 00:13:40 [scrapy.core.engine] INFO: Spider opened
2020-05-22 00:13:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-22 00:13:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-22 00:13:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wolt.com/ru/kaz/almaty/restaurant/> (failed 1 times): 404 Not Found
2020-05-22 00:13:44 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://wolt.com/ru/kaz/almaty/restaurant/> (failed 2 times): 404 Not Found
2020-05-22 00:13:44 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://wolt.com/ru/kaz/almaty/restaurant/> (failed 2 times): 404 Not Found
2020-05-22 00:13:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://wolt.com/ru/kaz/almaty/restaurant/> (referer: None)
2020-05-22 00:13:44 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-22 00:13:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 622,
...
'start_time': datetime.datetime(2020, 5, 21, 18, 13, 40, 862817)}
2020-05-22 00:13:44 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:0)
问题是您的蜘蛛从https://wolt.com/ru/kaz/almaty/restaurant/开始,这是一个404页(也未找到)。您应该将start_urls
更改为带有https://wolt.com/ru/kaz/almaty/restaurant/la-pizza-2之类的数据。另外,如果未覆盖start_requests,则没有定义解析方法,该方法是默认的钩子。然后您的xpath中存在时间错误,您错过了/div
。
试试这个
class QuotesSpiderSpider(CrawlSpider):
name = 'wolt'
allowed_domains = ['wolt.com']
start_urls = ['https://wolt.com/ru/kaz/almaty/restaurant/la-pizza-2']
def parse(self, response):
title = response.xpath(".//div[@class='VenueHeroBanner__container___1_lK2']/h1[@class='VenueHeroBanner__title___2EzpN']/text()").get()
time = response.xpath(".//div[@class='VenueSide__infoLine___jrSHX']/div[@class='VenueSide__hours___122Zm']/div/text()").get()
yield QuoteItem(title=title, time=time)