使用scrapy身份验证递归爬网

时间:2014-02-28 17:59:23

标签: python forms-authentication web-crawler scrapy

我正在尝试抓取需要身份验证的网页。

我面临的问题是,登录第一次正常工作并且我获得了Successfull登录日志,但是当爬虫开始从start_url抓取页面时,它不会捕获csv文件中的页面输出需要登录凭据才能查看数据。

我是否遗漏了通过整个过程保留登录会话的任何内容,或者检查每个需要登录的网址,然后才继续。

我的登录表单是一个帖子表单,输出如下 -

2014-02-28 21:16:53 + 0000 [myspider]信息:抓0页(每分钟0页),抓0件(0件/分)

2014-02-28 21:16:53 + 0000 [scrapy] DEBUG:Telnet控制台收听0.0.0.0:6023

2014-02-28 21:16:53 + 0000 [scrapy] DEBUG:网络服务收听0.0.0.0:6080

2014-02-28 21:16:53 + 0000 [myspider] DEBUG:Crawled(200)https://someurl.com/login_form> (引用者:无)

2014-02-28 21:16:53 + 0000 [myspider] DEBUG:Crawled(200)https://someurl.com/search> (引用者:https://someurl.com/login_form

2014-02-28 21:16:53 + 0000 [myspider] DEBUG:已成功登录。开始抓取!

它会在第一次点击时自动转到搜索页面而不是login_form页面(start_url)

请有人帮我解决这个问题吗?

以下是我的代码:

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import urlparse
from scrapy import log


class MySpider(CrawlSpider):

        name = 'myspider'
        allowed_domains = ['someurl.com']
        login_page = 'https://someurl.com/login_form'
        start_urls = 'https://someurl.com/'

        rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse_item')]

        def start_requests(self):

            yield Request(
                url=self.login_page,
                callback=self.login,
                dont_filter=True
            )


        def login(self, response):
            """Generate a login request."""
            return FormRequest.from_response(response,
                    formdata={'__ac_name': 'username', '__ac_password': 'password' },
                    callback=self.check_login_response)


        def check_login_response(self, response):
            if "Sign Out" in response.body:
                self.log("Successfully logged in. Start Crawling")
                return Request(url=self.start_urls)
            else:
                self.log("Not Logged in")


        def parse_item(self, response):

            # Scrape data from page
            items = []
            failed_urls = []
            hxs = HtmlXPathSelector(response)

            urls = hxs.select('//base/@href').extract()
            urls.extend(hxs.select('//link/@href').extract())
            urls.extend(hxs.select('//a/@href').extract())
            urls = list(set(urls))

            for url in urls :

                item = DmozItem()

                if response.status == 404:
                    failed_urls.append(response.url)
                    self.log('failed_url : %s' % failed_urls)
                    item['failed_urls'] = failed_urls
                else :

                    if url.startswith('http') :
                        if url.startswith('https://someurl.com'):
                            item['internal_link'] = url
                            self.log('internal_link : %s ' % url)
                        else :
                            item['external_link'] = url
                            self.log('external_link : %s ' % url)

                items.append(item)

            items = list(set(items))
            return items

2 个答案:

答案 0 :(得分:0)

你需要一个无头浏览器,而不仅仅是一个刮刀。尝试用scrapyjs(https://github.com/scrapinghub/scrapyjs)或硒延长scrapy。

答案 1 :(得分:0)

您可以使用FormRequest函数通过Scrapy传递身份验证,如下所示:

scrapy.FormRequest(
        self.start_urls[0],
        formdata={'LoginForm[username]':username_scrapy, 'LoginForm[password]':password_scrapy,'yt0': 'Login'},
        headers=self.headers)

LoginForm [username],LoginForm [password]是通过登录表单传递的变量