在scrapy中提取json响应

时间:2017-08-18 10:51:01

标签: python json api web-scraping scrapy

如何使用Scrapy来刮取使用JSON格式的API? JSON看起来像这样:

  "records": [
    {
      "uri": "https://www.example.com",
      "access": {
        "update": false
      },
      "id": 17059,
      "vid": 37614,
      "name": "MyLibery",
      "claim": null,
      "claimedBy": null,
      "authorUid": "3",
      "lifecycle": "L",
      "companyType": "S",
      "ugcState": 10,
      "companyLogo": {
        "fileName": "mylibery-logo.png",
        "filePath": "sites/default/files/imagecache/company_logo_70/mylibery-logo.png"
      }

我试过这段代码:

import scrapy
import json


class ApiItem(scrapy.Item):
    url = scrapy.Field()
    Name = scrapy.Field()


class ExampleSpider(scrapy.Spider):
    name = 'API'
    allowed_domains = ["site.com"]
    start_urls = [l.strip() for l in open('pages.txt').readlines()]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)
        jsonresponse = json.loads(response.body_as_unicode())
        item = ApiItem()
        item["url"] = jsonresponse["uri"]
        item["Name"] = jsonresponse["name"]
        return item

" Pages.txt"是我想要抓取的API页面列表,我只想提取" uri"和"名称"并将其保存到csv。

但它引发了一个错误说:

2017-08-18 13:23:02 [scrapy] ERROR: Spider error processing <GET https://www.investiere.ch/proxy/api2/v1/companies?extra%5Bimagecache%5D=company_logo_70&fields=companyType,lifecycle&page=8&parameters%5Binclude_skipped%5D=yes> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/habenn/Projects/inapi/inapi/spiders/example.py", line 22, in parse
    item["url"] = jsonresponse["uri"]
KeyError: 'uri'

1 个答案:

答案 0 :(得分:1)

从给出的例子中,它应该是这样的:

item["url"] = jsonresponse["records"][0]["uri"]
item["Name"] = jsonresponse["records"][0]["name"]

修改

要从响应中获取所有uriname,请使用以下命令:

def parse(self, response):
    ...
    for record in jsonresponse["records"]:
        item = ApiItem()
        item["url"] = record["uri"]
        item["Name"] = record["name"]
        yield item

请注意,请将return替换为yield