Question

我想从网页中提取JSON数据，所以我已经检查过了。我需要的数据以下列格式存储：

<script type="application/ld+json">
    {
     'data I want to extract'
    }
    </script>

我试图使用：

import scrapy
import json

class OpenriceSpider(scrapy.Spider):
    name = 'openrice'
    allowed_domains = ['www.openrice.com']

    def start_requests(self):
        headers = {
            'accept-encoding': 'gzip, deflate, sdch, br',
            'accept-language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36     (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
            'accept':     'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'cache-control': 'max-age=0',
        }
        url = 'https://www.openrice.com/en/hongkong/r-kitchen-one-cafe-sha-tin-western-r483821'
        yield scrapy.Request(url=url, headers=headers, callback=self.parse)

    def parse(self, response):  # response = request url ?
        items = []
        jsonresponse = json.loads(response)

但它不起作用，我应该如何改变呢？

Answer 1

您需要在HTML源代码中找到该script元素，提取其文本，然后仅使用json.loads()加载：

script = response.xpath("//script[@type='application/ld+json']/text()").extract_first()
json_data = json.loads(script)
print(json_data)

在这里，我使用不太常见的application/ld+json来找到script，但是还有很多其他选项 - 比如，使用您知道它在脚本中的一些文本找到脚本本身：

//script[contains(., 'Restaurant')]

通过Python2中的scrapy从web中读取json

1 个答案: