Scrapy + Selenium 302重定向处理

时间:2015-12-27 21:19:26

标签: javascript python html selenium selenium-webdriver

所以我正在构建一个登录我的银行帐户的网络抓取工具,并收集有关我的支出的数据。我原本打算只使用Scrapy但是它没有用,因为First Merit页面使用Javascript登录,所以我把Selenium放在最上面。

我的代码登录(首先你需要输入用户名,然后输入密码,而不是像在大多数页面中那样),通过一系列具有特定回调函数的让步请求来处理下一步。

import scrapy
from scrapy import Request
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import selenium
import time


class LoginSpider(scrapy.Spider):
    name = 'www.firstmerit.com'
   # allowed_domains = ['https://www.firstmeritib.com']
    start_urls = ['https://www.firstmeritib.com/AccountHistory.aspx?a=1']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        # Obtaining necessary components to input my own stuff
        username = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="txtUsername"]'))
        login_button = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="btnLogin"]'))

        # The actual interaction
        username.send_keys("username")
        login_button.click()

        # The process of logging in is broken up in two functions since the website requires me
        # to enter my username first which redirects me to a password page where I cna finally enter my account (after inputting password)
        yield Request(url = self.driver.current_url,
                      callback = self.password_handling,
                       meta = {'dont_redirect' : True,
                               'handle_httpstatus_list': [302],
                               'cookiejar' : response}
        )


    def password_handling(self, response):

        print("^^^^^^")
        print(response.url)

        password = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="MainContent_txtPassword"]'))
        login_button2 = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="MainContent_btnLogin"]'))

        password.send_keys("password")
        login_button2.click()

        print("*****")
        print(self.driver.current_url)
        print("*****")

        yield Request (url = self.driver.current_url,
                       callback = self.after_login, #, dont_filter = True,
                       meta = {'dont_redirect' : True,
                               'handle_httpstatus_list': [302],
                               'cookiejar' : response.meta['cookiejar'] }
                       )

    def after_login(self, response):
        print"***"
        print(response.url)
        print"***"

        print(response.body)

        if "Account Activity" in response.body:
            self.logger.error("Login failed")
            return
        else:
            print("you got through!")
            print()

问题是,一旦我最终进入我的帐户页面,我的所有支出都会显示,我实际上无法访问HTML数据。我已正确处理302重定向,但" meta ="选项似乎带我通过selenium进入页面,但不要让我刮掉它。

我没有从after_login函数中的response.body获取所有数据,而是获得以下内容:

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="/Default.aspx?ReturnURL=%2fAccountHistory.aspx%3fa%3d1">here</a>.</h2>
</body></html> 

我如何能够真正获取这些信息? 这个重定向是否由银行保护以防止帐户被抓取? 谢谢!

0 个答案:

没有答案