Scrapy:我如何解析JSON响应?

时间:2015-03-22 11:22:41

标签: python json scrapy

我有spider(点击查看来源),非常适合常规的html页面抓取。 但是,我想添加一个额外的功能。我想解析一个JSON页面。

这是我想要做的(这里是手动完成的,没有scrapy):

import requests, json
import datetime

def main():
    user_agent = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
    }

    # This is the URL that outputs JSON:
    externalj = 'http://www.thestudentroom.co.uk/externaljson.php?&s='
    # Form the end of the URL, it is based on the time (unixtime):

    past = datetime.datetime.now() - datetime.timedelta(minutes=15)
    time = past.strftime('%s')
    # This is the full URL:
    url = externalj + time

    # Make the HTTP get request:
    tsr_data = requests.get(url, headers= user_agent).json()

    # Iterate over the json data and form the URLs 
    # (there are no URLs at all in the JSON data, they must be formed manually):

    # URL is formed simply by concatenating the canonical link with a thread-id:

    for post in tsr_data['discussions-recent']:
        link= 'www.thestudentroom.co.uk/showthread.php?t='
        return link + post['threadid']

此函数将返回我想要抓取的HTML页面(论坛帖子的链接)的正确链接。我似乎需要创建自己的请求对象才能发送到spider中的parse_link

我的问题是,我在哪里放这个代码?我很困惑如何将其纳入scrapy?我需要创建另一个蜘蛛吗?

理想情况下,我希望它可以与the spider that I already have一起使用,但不确定这是否可行。

如何在scrapy中实现这一点非常困惑。我希望有人可以提供建议!

我现在的蜘蛛是这样的:

import scrapy
from tutorial.items import TsrItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class TsrSpider(CrawlSpider):
    name = 'tsr'
    allowed_domains = ['thestudentroom.co.uk']

    start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=89']

    download_delay = 2
    user_agent = 'youruseragenthere'

    thread_xpaths = ("//tr[@class='thread  unread    ']",
            "//*[@id='discussions-recent']/li/a",
            "//*[@id='discussions-popular']/li/a")

    rules = [
        Rule(LinkExtractor(allow=('showthread\.php\?t=\d+',),
            restrict_xpaths=thread_xpaths),
        callback='parse_link', follow=True),]

    def parse_link(self, response):
        for sel in response.xpath("//li[@class='post threadpost old   ']"):
            item = TsrItem()
            item['id'] = sel.xpath(
"div[@class='post-header']//li[@class='post-number museo']/a/span/text()").extract()
            item['rating'] = sel.xpath(
"div[@class='post-footer']//span[@class='score']/text()").extract()
            item['post'] = sel.xpath(
"div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
            item['link'] = response.url
            item['topic'] = response.xpath(
"//div[@class='forum-header section-header']/h1/span/text()").extract()
            yield item

2 个答案:

答案 0 :(得分:1)

似乎我找到了一种让它工作的方法。也许我原来的帖子不清楚。

我想解析一个JSON响应,然后发送一个请求,以便scrapy进一步处理。

我在蜘蛛上添加了以下内容:

# A request object is required.
from scrapy.http import Request

def parse_start_url(self, response):
    if  'externaljson.php' in str(response.url):
        return self.make_json_links(response)

parse_start_url似乎就像它说的那样。它解析了初始网址(开始网址)。此处只应处理JSON页面。

因此我需要使用我的html网址添加我的特殊JSON网址:

start_urls = ['http://tsr.com/externaljson.php', 'http://tsr.com/thread.html']

我现在需要从JSON页面的响应中以请求的形式生成URL:

def make_json_links(self, response):
    ''' Creates requests from JSON page. '''
    data = json.loads(response.body_as_unicode())
    for post in data['discussions-recent']:
        link = 'http://www.tsr.co.uk/showthread.php?t='
        full_link = link + str(post['threadid'])
        json_request = Request(url=full_link)
        return json_request

现在似乎有效。但是,我确信这是一种实现这一目标的hacky和不优雅的方式。不知何故感觉不对。

它似乎工作,它遵循我从JSON页面做的所有链接。我也不确定我是否应该在那里的某个地方使用yield代替return ...

答案 1 :(得分:0)

链接是否始终遵循相同的格式?是否无法为JSON链接创建新规则,并使用单独的parse_json函数作为回调?