我正在尝试使用Scrapy将来自New York Times API的JSON响应解析为CSV,以便我可以对特定查询的所有相关文章进行摘要。我想将其作为带有链接,发布日期,摘要和标题的CSV进行吐出,以便我可以在摘要描述上运行一些关键字搜索。我是Python和Scrapy的新手,但这是我的蜘蛛(我收到了HTTP 400错误)。我已经在蜘蛛中xx了我的api键:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from nytimesAPIjson.items import NytimesapijsonItem
import json
import urllib2
class MySpider(BaseSpider):
name = "nytimesapijson"
allowed_domains = ["http://api.nytimes.com/svc/search/v2/articlesearch"]
req = urllib2.urlopen('http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-key=xxx)
def json_parse(self, response):
jsonresponse= json.loads(response)
item = NytimesapijsonItem()
item ["pubDate"] = jsonresponse["pub_date"]
item ["description"] = jsonresponse["lead_paragraph"]
item ["title"] = jsonresponse["print_headline"]
item ["link"] = jsonresponse["web_url"]
items.append(item)
return items
如果有任何想法/建议,包括Scrapy以外的任何想法/建议,请告诉我。提前谢谢。
答案 0 :(得分:2)
您应该设置start_urls
并使用parse
方法:
from scrapy.spider import BaseSpider
import json
class MySpider(BaseSpider):
name = "nytimesapijson"
allowed_domains = ["api.nytimes.com"]
start_urls = ['http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-key=xxx']
def parse(self, response):
jsonresponse = json.loads(response)
print jsonresponse