我有抓取脚本,但是我不能抓取数据,不知道为什么

时间:2019-07-24 06:44:06

标签: web-scraping scrapy

我运行了脚本,但没有,但是URL上有数据

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector

class GetSpider(scrapy.Spider):
    name = 'gets'
    start_urls = ['https://www.retailmenot.com/coupons/insurance?u=ZTF65B5PJZEU3JDF326WY2SXOQ']

    def parse(self, response):
        s = Selector(response)
        code = s.xpath("//button[contains(@class,'CopyCode')][1]/text()").get()

        yield {'code':code}

我希望52岁,但我却没有

1 个答案:

答案 0 :(得分:0)

最简单的方法可能是将json中的json作为python字典加载,并在其中导航以获取代码。 以下代码将帮助您入门:

import scrapy
import json
import logging


class GetSpider(scrapy.Spider):
    name = 'gets'
    start_urls = ['https://www.retailmenot.com/coupons/insurance?u=ZTF65B5PJZEU3JDF326WY2SXOQ']
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    }
    custom_settings = {'ROBOTSTXT_OBEY': False}

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,
                                 callback=self.parse,
                                 headers=self.headers,
                                 dont_filter=True)

    def parse(self, response):
        script = response.xpath(
            '//script[contains(text(), "__NEXT_DATA__")]/text()'
        ).extract_first()
        dict_start_index = script.index('{')
        dict_end_index = script.index('};') + 1
        data = json.loads(script[dict_start_index:dict_end_index])
        coupon_data = data['props']['pageProps']['serverState']['apollo']['data']
        for key, value in coupon_data.items():
            try:
                code = value['code']
            except KeyError:
                logging.debug("no code found")
            else:
                yield {'code': code}