Question

我正在使用Python 3.5.2和Scrapy 1.1。

下面的演示中有一个嵌套请求，具体来说，在article内容的页面中，有一个ajax请求，当他登录时会获得author。

我写的如下，我无法得到作者，我很难理解问题所在。

演示：

# -*- coding: utf-8 -*-
import scrapy
from demo.items import ExampleItem
from scrapy.spiders import CrawlSpider
import re

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.com"]

    start_urls = [
        "http://www.example.com/articles-list.php?page=1",
        "http://www.example.com/articles-list.php?page=2",
        "http://www.example.com/articles-list.php?page=3",
        "http://www.example.com/articles-list.php?page=4",
        "http://www.example.com/articles-list.php?page=5",
        "http://www.example.com/articles-list.php?page=6",
    ]

    headers = {
        'Accept':'*/*',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Connection':'keep-alive',
        'Cookie':'PHPSESSID=12345370000029b72333333dc999999; QS[uid]=100; QS[username]=example; QS[password]=example.com; QS[pmscount]=1',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2774.3 Safari/537.36',
        'X-Requested-With':'XMLHttpRequest'
    }

    def parse(self, response):
        hrefs = response.xpath('a/@href')
        for href in hrefs:
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_article_contents)


    def parse_article_contents(self, response):
        for sel in response.xpath('/html/body'):
            item = ExampleItem()
            item['articleUrl'] = response.url
            item['title'] = sel.xpath('div[3]/a[2]/@href')[0].extract()
            item['content'] = sel.xpath('div[2]/div[2]/div[1]/text()')[0].extract()

            #In the page of artile content,there is an ajax request,which get the auther when login.
            articleId = re.search(u'id=(\d{1,4})&', item['articleUrl']).group(1)
            articleAuthorUrl = 'http://www.example.com/plus/ajax_author.php?id=' + articleId

            #Crawling auther below.Is it correct?
            def request_article_author(self):
                return scrapy.Request(url=articleAuthorUrl,headers=headers,callback=self.parse_article_author)

            def parse_article_author(self, response):
                item['author'] = response.xpath('/html/body/div/div[1]/div[2]/text()').extract()

        # item['author'] can be yielded var "yield item" below?
        yield item

有什么想法吗？

Scrapy嵌套请求

0 个答案: