如何使用Scrapy来刮取使用JSON格式的API? JSON看起来像这样:
"records": [
{
"uri": "https://www.example.com",
"access": {
"update": false
},
"id": 17059,
"vid": 37614,
"name": "MyLibery",
"claim": null,
"claimedBy": null,
"authorUid": "3",
"lifecycle": "L",
"companyType": "S",
"ugcState": 10,
"companyLogo": {
"fileName": "mylibery-logo.png",
"filePath": "sites/default/files/imagecache/company_logo_70/mylibery-logo.png"
}
我试过这段代码:
import scrapy
import json
class ApiItem(scrapy.Item):
url = scrapy.Field()
Name = scrapy.Field()
class ExampleSpider(scrapy.Spider):
name = 'API'
allowed_domains = ["site.com"]
start_urls = [l.strip() for l in open('pages.txt').readlines()]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
jsonresponse = json.loads(response.body_as_unicode())
item = ApiItem()
item["url"] = jsonresponse["uri"]
item["Name"] = jsonresponse["name"]
return item
" Pages.txt"是我想要抓取的API页面列表,我只想提取" uri"和"名称"并将其保存到csv。
但它引发了一个错误说:
2017-08-18 13:23:02 [scrapy] ERROR: Spider error processing <GET https://www.investiere.ch/proxy/api2/v1/companies?extra%5Bimagecache%5D=company_logo_70&fields=companyType,lifecycle&page=8¶meters%5Binclude_skipped%5D=yes> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/habenn/Projects/inapi/inapi/spiders/example.py", line 22, in parse
item["url"] = jsonresponse["uri"]
KeyError: 'uri'
答案 0 :(得分:1)
从给出的例子中,它应该是这样的:
item["url"] = jsonresponse["records"][0]["uri"]
item["Name"] = jsonresponse["records"][0]["name"]
修改强>
要从响应中获取所有uri
和name
,请使用以下命令:
def parse(self, response):
...
for record in jsonresponse["records"]:
item = ApiItem()
item["url"] = record["uri"]
item["Name"] = record["name"]
yield item
请注意,请将return
替换为yield
。