我是网络爬虫的新手,只是关注本文。很容易理解。
https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
我有1个网站目标。目的是在 ais-Hits-list 类下获得产品价格和名称列表。
例如->价格( 259 )和名称( XT7女式越野跑鞋深蓝色和粉红色)
<header id="header">
<div id="search-suggestions-algolia" style="display:none">
<div>
<div class="ais-Hits">
<ol class="ais-Hits-list">
<li class="ais-Hits-item"></li>
<li class="ais-Hits-item"></li>
<li class="ais-Hits-item"></li>
<li class="ais-Hits-item"></li>
</ol>
</div>
</div>
</div>
</header>
<section id="wrapper">
<div id="#content" class="site-content shop-grid">
<div id="">
<div id="js-product-list">
<div class="products product_content" id="hits">
<div>
<div class="ais-Hits">
<ol class="ais-Hits-list">
<li class="ais-Hits-item">...</li>
<li class="ais-Hits-item">...</li>
<li class="ais-Hits-item">
<div class="price-button">
<a class="algolia_link" href="/p/8552163_xt7-women-s-trail-running-shoes-dark-blue-and-pink.html">
<div class="product_price">
<button class="is-skewed" itemprop="price">259</button>
</div>
<div class="product_name">
<h4 class="title-single" title="XT7 women's trail running shoes dark blue and pink"></h4>
</div>
</a>
<a class="product-flags"></a>
</div>
</li>
<li class="ais-Hits-item">...</li>
</ol>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
我的代码是
# scrapy runspider scraper.py
import scrapy
import json
class xxxx(scrapy.Spider):
name = 'xxxx_spider'
start_urls = ['https://www.xxxx.co.id/8548777-running-shoes']
allowed_domains = ['xxxx.co.id']
def parse(self, response):
PRODUCT_SELECTOR = '#js-product-list .product_content .ais-Hits .ais-Hits-list .ais-Hits-item'
for item in response.css(PRODUCT_SELECTOR):
NAME_SELECTOR = 'h4 ::attr(title)'
PRICE_SELECTOR = 'button ::text'
yield {
'name': item.css(NAME_SELECTOR).extract_first(),
'price': item.css(PRICE_SELECTOR).extract_first()
}
但是它始终不返回任何内容。我缺少任何地方吗?
2019-12-11 18:16:39 [scrapy.core.engine] INFO: Spider opened
2019-12-11 18:16:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-12-11 18:16:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-12-11 18:16:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.xxxx.co.id/8548777-running-shoes> (referer: None)
[]
2019-12-11 18:16:42 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-11 18:16:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 29492,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.876863,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 12, 11, 10, 16, 42, 734103),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 12, 11, 10, 16, 39, 857240)}
2019-12-11 18:16:42 [scrapy.core.engine] INFO: Spider closed (finished)
该网站为ais-Hits-item提供2个类(标题和部分)。
之所以没有放入PRODUCT_SELECTOR='.ais-Hits-item'
,是因为它将直接指向ais-Hits-item标头,而不是第一节。
答案 0 :(得分:0)
问题在于此页面是由Javascript呈现的。 您可以在浏览器的“网络”标签中看到它正在发出发布请求以检索数据。
您可以通过Scrapy发送发帖请求以检索json响应,然后进行解析。
另一个选项,因为页面中还存在一个带有产品数据的javascript变量:
将页面上存在的
workbox.core.setLogLevel(workbox.core.LOG_LEVELS.debug);
self.addEventListener('install', event => event.waitUntil(self.skipWaiting()));
self.addEventListener('activate', event => event.waitUntil(self.clients.claim()));
// We need this in Webpack plugin (refer to swSrc option): https://developers.google.com/web/tools/workbox/modules/workbox-webpack-plugin#full_injectmanifest_config
// REMOVE THIS:
//workbox.precaching.precacheAndRoute(self.__precacheManifest);
workbox.routing.registerRoute("/", workbox.strategies.networkOnly());
javascript变量加载到python字典中,然后进行解析。
var thispageproduct