Question

我有一个只有登录的网站，我想在http://145.100.108.148/login2/login.php登录，然后抓取下一页http://145.100.108.148/login2/index.php。

两个.html网站都必须保存到磁盘。

from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

class TestSpider(CrawlSpider):
    name = 'testspider'
    login_page = 'http://145.100.108.148/login2/login.php'
    start_urls = ['http://145.100.108.148/login2/index.php'
    ]
    rules = (
        Rule(LinkExtractor(allow=r'.*'),
             callback='parse_item', follow=True),
    )
    login_user = 'test@hotmail.com'
    login_pass = 'test'

    def start_request(self):
        """This function is called before crawling starts"""
        return [Request(url=self.login_page, callback=self.login)]

    def login(self, response):
        """Generate a login request"""
        return FormRequest.from_response(response,
                    formdata={
                    'email': self.login_user,
                    'pass': self.login_pass},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in"""
        if b"Dashboard" in response.body:
            self.logger.info("successfully logged in. Let's start crawling!")
            return self.initialized()
        else:
            self.logger.info("NOT LOGGED IN :(")
            # Something went wrong, we couldn't log in, so nothing happens.
            return

    def parse_item(self, response):
        """Save pages to disk"""
        self.logger.info('Hi, this is an item page! %s', response.url)
        page = response.url.split("/")[-2]
        filename = 'scraped-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

输出

2018-01-16 10:32:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-01-16 10:32:14 [scrapy.core.engine] INFO: Spider opened
2018-01-16 10:32:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 10:32:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-16 10:32:14 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://145.100.108.148/robots.txt> (referer: None)
2018-01-16 10:32:14 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <302 http://145.100.108.148/login2/index.php>
Set-Cookie: PHPSESSID=4oeh65l59aeutc2qetvgtpn0c6; path=/

2018-01-16 10:32:14 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://145.100.108.148/login2/login.php> from <GET http://145.100.108.148/login2/index.php>
2018-01-16 10:32:14 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://145.100.108.148/login2/login.php>
Cookie: PHPSESSID=4oeh65l59aeutc2qetvgtpn0c6

2018-01-16 10:32:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://145.100.108.148/login2/login.php> (referer: None)
2018-01-16 10:32:14 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://145.100.108.148/login2/register.php>
Cookie: PHPSESSID=4oeh65l59aeutc2qetvgtpn0c6

2018-01-16 10:32:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://145.100.108.148/login2/register.php> (referer: http://145.100.108.148/login2/login.php)
2018-01-16 10:32:14 [testspider] INFO: Hi, this is an item page! http://145.100.108.148/login2/register.php
2018-01-16 10:32:14 [testspider] DEBUG: Saved file scraped-login2.html
2018-01-16 10:32:14 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://145.100.108.148/login2/register.php> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-01-16 10:32:14 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://145.100.108.148/login2/login.php>
Cookie: PHPSESSID=4oeh65l59aeutc2qetvgtpn0c6

2018-01-16 10:32:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://145.100.108.148/login2/login.php> (referer: http://145.100.108.148/login2/register.php)
2018-01-16 10:32:14 [testspider] INFO: Hi, this is an item page! http://145.100.108.148/login2/login.php
2018-01-16 10:32:14 [testspider] DEBUG: Saved file scraped-login2.html
2018-01-16 10:32:14 [scrapy.core.engine] INFO: Closing spider (finished)

因此，在抓取时，无论蜘蛛是否已登录，都会输出no。即使创建了IF / ELSE声明，也可以开始check_login_response

我也不确定抓取工具是否具有经过身份验证的会话。只有一个保存的文件，名为scraped-login2.html，而我预计至少有3个文件。哪个是register page，login page和index.php页。

Answer 1

CrawlSpider继承自Spider，init_request从InitSpider继承时有效。所以你需要在下面改变

def init_request(self):
    """This function is called before crawling starts"""
    return Request(url=self.login_page, callback=self.login)

到

def start_requests(self):
    """This function is called before crawling starts"""
    return [Request(url=self.login_page, callback=self.login)]

接下来，response.body中的响应将是字节。所以你需要改变

if "Dashboard" in response.body:

到

if b"Dashboard" in response.body:

Answer 2

感谢@Tarun Lalwani和一些试验＆amp;错误，这是结果：

from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest

class LoginSpider(CrawlSpider):
    name = 'loginspider'
    login_page = 'http://145.100.108.148/login2/login.php'
    start_urls = ['http://145.100.108.148/login2/index.php']
    username = 'test@hotmail.com'
    password = 'test'

    def init_request(self):
        return Request(url=self.login_page, callback=self.start_requests)

    def start_requests(self):
        print ("\n start_request is here \n")
        yield Request(
        url = self.login_page,
        callback = self.login,
        dont_filter = True
        )

    def login(self, response):
        print ("\n Login is here! \n")
        return FormRequest.from_response(response,
        formdata={  'email': self.username,
                    'pass': self.password},
        callback=self.check_login_response)

    def check_login_response(self, response):
        print ("\n Check_login_response \n")
        if b"Learn" in response.body:
            print("Worked, logged in")
            #return self.parse_item
        else:
            print("Not logged in")
            return

Scrapy - 登录不工作

2 个答案: