Question

我有spider（点击查看来源），非常适合常规的html页面抓取。但是，我想添加一个额外的功能。我想解析一个JSON页面。

这是我想要做的（这里是手动完成的，没有scrapy）：

import requests, json
import datetime

def main():
    user_agent = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
    }

    # This is the URL that outputs JSON:
    externalj = 'http://www.thestudentroom.co.uk/externaljson.php?&s='
    # Form the end of the URL, it is based on the time (unixtime):

    past = datetime.datetime.now() - datetime.timedelta(minutes=15)
    time = past.strftime('%s')
    # This is the full URL:
    url = externalj + time

    # Make the HTTP get request:
    tsr_data = requests.get(url, headers= user_agent).json()

    # Iterate over the json data and form the URLs 
    # (there are no URLs at all in the JSON data, they must be formed manually):

    # URL is formed simply by concatenating the canonical link with a thread-id:

    for post in tsr_data['discussions-recent']:
        link= 'www.thestudentroom.co.uk/showthread.php?t='
        return link + post['threadid']

此函数将返回我想要抓取的HTML页面（论坛帖子的链接）的正确链接。我似乎需要创建自己的请求对象才能发送到spider中的parse_link？

我的问题是，我在哪里放这个代码？我很困惑如何将其纳入scrapy？我需要创建另一个蜘蛛吗？

理想情况下，我希望它可以与the spider that I already have一起使用，但不确定这是否可行。

如何在scrapy中实现这一点非常困惑。我希望有人可以提供建议！

我现在的蜘蛛是这样的：

import scrapy
from tutorial.items import TsrItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class TsrSpider(CrawlSpider):
    name = 'tsr'
    allowed_domains = ['thestudentroom.co.uk']

    start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=89']

    download_delay = 2
    user_agent = 'youruseragenthere'

    thread_xpaths = ("//tr[@class='thread  unread    ']",
            "//*[@id='discussions-recent']/li/a",
            "//*[@id='discussions-popular']/li/a")

    rules = [
        Rule(LinkExtractor(allow=('showthread\.php\?t=\d+',),
            restrict_xpaths=thread_xpaths),
        callback='parse_link', follow=True),]

    def parse_link(self, response):
        for sel in response.xpath("//li[@class='post threadpost old   ']"):
            item = TsrItem()
            item['id'] = sel.xpath(
"div[@class='post-header']//li[@class='post-number museo']/a/span/text()").extract()
            item['rating'] = sel.xpath(
"div[@class='post-footer']//span[@class='score']/text()").extract()
            item['post'] = sel.xpath(
"div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
            item['link'] = response.url
            item['topic'] = response.xpath(
"//div[@class='forum-header section-header']/h1/span/text()").extract()
            yield item

Answer 1

似乎我找到了一种让它工作的方法。也许我原来的帖子不清楚。

我想解析一个JSON响应，然后发送一个请求，以便scrapy进一步处理。

我在蜘蛛上添加了以下内容：

# A request object is required.
from scrapy.http import Request

和

def parse_start_url(self, response):
    if  'externaljson.php' in str(response.url):
        return self.make_json_links(response)

parse_start_url似乎就像它说的那样。它解析了初始网址（开始网址）。此处只应处理JSON页面。

因此我需要使用我的html网址添加我的特殊JSON网址：

start_urls = ['http://tsr.com/externaljson.php', 'http://tsr.com/thread.html']

我现在需要从JSON页面的响应中以请求的形式生成URL：

def make_json_links(self, response):
    ''' Creates requests from JSON page. '''
    data = json.loads(response.body_as_unicode())
    for post in data['discussions-recent']:
        link = 'http://www.tsr.co.uk/showthread.php?t='
        full_link = link + str(post['threadid'])
        json_request = Request(url=full_link)
        return json_request

现在似乎有效。但是，我确信这是一种实现这一目标的hacky和不优雅的方式。不知何故感觉不对。

它似乎工作，它遵循我从JSON页面做的所有链接。我也不确定我是否应该在那里的某个地方使用yield代替return ...

Answer 2

链接是否始终遵循相同的格式？是否无法为JSON链接创建新规则，并使用单独的parse_json函数作为回调？

Scrapy：我如何解析JSON响应？

2 个答案: