所以我正在构建一个登录我的银行帐户的网络抓取工具,并收集有关我的支出的数据。我原本打算只使用Scrapy但是它没有用,因为First Merit页面使用Javascript登录,所以我把Selenium放在最上面。
我的代码登录(首先你需要输入用户名,然后输入密码,而不是像在大多数页面中那样),通过一系列具有特定回调函数的让步请求来处理下一步。
import scrapy
from scrapy import Request
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import selenium
import time
class LoginSpider(scrapy.Spider):
name = 'www.firstmerit.com'
# allowed_domains = ['https://www.firstmeritib.com']
start_urls = ['https://www.firstmeritib.com/AccountHistory.aspx?a=1']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
# Obtaining necessary components to input my own stuff
username = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="txtUsername"]'))
login_button = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="btnLogin"]'))
# The actual interaction
username.send_keys("username")
login_button.click()
# The process of logging in is broken up in two functions since the website requires me
# to enter my username first which redirects me to a password page where I cna finally enter my account (after inputting password)
yield Request(url = self.driver.current_url,
callback = self.password_handling,
meta = {'dont_redirect' : True,
'handle_httpstatus_list': [302],
'cookiejar' : response}
)
def password_handling(self, response):
print("^^^^^^")
print(response.url)
password = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="MainContent_txtPassword"]'))
login_button2 = WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_xpath('//*[@id="MainContent_btnLogin"]'))
password.send_keys("password")
login_button2.click()
print("*****")
print(self.driver.current_url)
print("*****")
yield Request (url = self.driver.current_url,
callback = self.after_login, #, dont_filter = True,
meta = {'dont_redirect' : True,
'handle_httpstatus_list': [302],
'cookiejar' : response.meta['cookiejar'] }
)
def after_login(self, response):
print"***"
print(response.url)
print"***"
print(response.body)
if "Account Activity" in response.body:
self.logger.error("Login failed")
return
else:
print("you got through!")
print()
问题是,一旦我最终进入我的帐户页面,我的所有支出都会显示,我实际上无法访问HTML数据。我已正确处理302重定向,但" meta ="选项似乎带我通过selenium进入页面,但不要让我刮掉它。
我没有从after_login函数中的response.body获取所有数据,而是获得以下内容:
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="/Default.aspx?ReturnURL=%2fAccountHistory.aspx%3fa%3d1">here</a>.</h2>
</body></html>
我如何能够真正获取这些信息? 这个重定向是否由银行保护以防止帐户被抓取? 谢谢!