InitSpider不会抓取或捕获数据

时间:2014-07-17 21:14:57

标签: python scrapy

相当不确定可用的信息,我应该从爬行蜘蛛继承哪个类。

下面的示例尝试从身份验证页面开始,然后继续抓取所有已登录的页面。根据发布的控制台输出,它验证正常,但即使是第一页也不能输出到JSON,并在第一个200状态页面后停止:

我得到了这个(新行,然后是左硬括号)

JSON文件

[

控制台输出

DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)

运行时:

scrapy crawl MY_crawler -o items.json

使用蜘蛛:

import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem

class MyCrawlerSpider(InitSpider):
    name = "MY_crawler"
    allowed_domains = ["mydomain.com"]
    login_page = 'https://www.mydomain.com/users/sign_in'
    start_urls = [
        "https://www.mydomain.com/",
    ]

    rules = (
        #requires trailing comma to force iterable vs tuple
        Rule(LinkExtractor(), callback='parse_item', follow=True),

    )

    def init_request(self):

        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        auth_token = response.xpath('authxpath').extract()[0]

        return FormRequest.from_response(
            response,
            formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
            callback=self.check_login_response)

    def check_login_response(self, response):

        if "Signed in successfully" in response.body:
            self.log("am logged in")
            self.initialized()

        else:
            self.log("couldn't login")
            print response.body

    def parse_item(self, response):

        item = MycrawlerItem()

        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract()[0]

        yield item

0 个答案:

没有答案