如何使用scrapy spider python在<ol> <li>下获取价值

时间:2019-12-11 10:32:26

标签: python web-scraping scrapy

我是网络爬虫的新手,只是关注本文。很容易理解。

https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3

我有1个网站目标。目的是在 ais-Hits-list 类下获得产品价格和名称列表。

例如->价格( 259 )和名称( XT7女式越野跑鞋深蓝色和粉红色

<header id="header">
    <div id="search-suggestions-algolia" style="display:none">
        <div>
            <div class="ais-Hits">
                <ol class="ais-Hits-list">
                    <li class="ais-Hits-item"></li>
                    <li class="ais-Hits-item"></li>
                    <li class="ais-Hits-item"></li>
                    <li class="ais-Hits-item"></li>
                </ol>
            </div>
        </div>
    </div>
</header>
<section id="wrapper">
    <div id="#content" class="site-content shop-grid">
        <div id="">
            <div id="js-product-list">
                <div class="products product_content" id="hits">
                    <div>
                        <div class="ais-Hits">
                            <ol class="ais-Hits-list">
                                <li class="ais-Hits-item">...</li>
                                <li class="ais-Hits-item">...</li>
                                <li class="ais-Hits-item">
                                    <div class="price-button">
                                        <a class="algolia_link" href="/p/8552163_xt7-women-s-trail-running-shoes-dark-blue-and-pink.html">
                                            <div class="product_price">
                                                <button class="is-skewed" itemprop="price">259</button>
                                            </div>
                                            <div class="product_name">
                                                <h4 class="title-single" title="XT7 women's trail running shoes dark blue and pink"></h4>
                                            </div>
                                        </a>
                                        <a class="product-flags"></a>
                                    </div>
                                </li>
                                <li class="ais-Hits-item">...</li>
                            </ol>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</section>

我的代码是

# scrapy runspider scraper.py
import scrapy
import json

class xxxx(scrapy.Spider):
    name = 'xxxx_spider'
    start_urls = ['https://www.xxxx.co.id/8548777-running-shoes']
    allowed_domains = ['xxxx.co.id']


    def parse(self, response):

        PRODUCT_SELECTOR = '#js-product-list .product_content .ais-Hits .ais-Hits-list .ais-Hits-item'

        for item in response.css(PRODUCT_SELECTOR):
            NAME_SELECTOR = 'h4 ::attr(title)'
            PRICE_SELECTOR = 'button ::text'

            yield {
                'name': item.css(NAME_SELECTOR).extract_first(),
                'price': item.css(PRICE_SELECTOR).extract_first()
            }

但是它始终不返回任何内容。我缺少任何地方吗?

2019-12-11 18:16:39 [scrapy.core.engine] INFO: Spider opened
2019-12-11 18:16:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-12-11 18:16:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-12-11 18:16:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.xxxx.co.id/8548777-running-shoes> (referer: None)
[]
2019-12-11 18:16:42 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-11 18:16:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 29492,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 2.876863,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 11, 10, 16, 42, 734103),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 12, 11, 10, 16, 39, 857240)}
2019-12-11 18:16:42 [scrapy.core.engine] INFO: Spider closed (finished)

该网站为ais-Hits-item提供2个类(标题和部分)。

之所以没有放入PRODUCT_SELECTOR='.ais-Hits-item',是因为它将直接指向ais-Hits-item标头,而不是第一节。

1 个答案:

答案 0 :(得分:0)

问题在于此页面是由Javascript呈现的。 您可以在浏览器的“网络”标签中看到它正在发出发布请求以检索数据。

您可以通过Scrapy发送发帖请求以检索json响应,然后进行解析。

另一个选项,因为页面中还存在一个带有产品数据的javascript变量: 将页面上存在的 workbox.core.setLogLevel(workbox.core.LOG_LEVELS.debug); self.addEventListener('install', event => event.waitUntil(self.skipWaiting())); self.addEventListener('activate', event => event.waitUntil(self.clients.claim())); // We need this in Webpack plugin (refer to swSrc option): https://developers.google.com/web/tools/workbox/modules/workbox-webpack-plugin#full_injectmanifest_config // REMOVE THIS: //workbox.precaching.precacheAndRoute(self.__precacheManifest); workbox.routing.registerRoute("/", workbox.strategies.networkOnly()); javascript变量加载到python字典中,然后进行解析。

var thispageproduct