我刚刚开始玩scrapy。我正在尝试抓取需要登录的网站。我让它对github工作得很好。我找到了表单ID,添加了必填字段,所有内容都按计划进行。
然而,当我在investopedia网站上尝试相同的操作时,我陷入了混乱。我附上了代码。
class Investo_spider(InitSpider):
name = 'investo_spider'
allowed_domains = ['investopedia.com']
login_page = 'http://www.investopedia.com/accounts/login.aspx'
start_urls = ['http://www.investopedia.com']
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
return FormRequest.from_response(response,
formdata={'email': 'mymail','password': 'mypass'},
callback=self.check_login_response)
def check_login_response(self, response):
if "myname" in response.body:
self.log("Successfully logged in. Let's start crawling!")
self.initialized()
else:
self.log("Login was unsuccessful")
def parse_item(self, response):
print 'I got in here, finally!!!!'
pass
我尝试添加formnumber = 0,clickdata = {'nr':0}并更改方法(POST或GET),尽管默认值已经选择了正确的表单并且可以点击。
令人惊讶的是,我使用相同的参数在机械化浏览器上工作。我可以将html转换为scrapy可以处理的HtmlResponse对象。
br = mechanize.Browser()
br.open("http://www.investopedia.com/accounts/login.aspx")
br.select_form(nr=0)
br.form["email"] = 'mymail'
br.form["password"] = 'mypass'
br.submit()
br.open('http://www.investopedia.com')
response = HtmlResponse(url="some_url"),body=br.response().read())
但是,这意味着我必须携带机械化浏览器,我认为这不是最好的解决方案。我想我可能会遗漏一些东西。我非常感谢您对此的意见。谢谢!
答案 0 :(得分:0)
您必须处理重定向。这对你有用。
class Investo_spider(scrapy.Spider):
name = 'investo_spider'
allowed_domains = ['investopedia.com']
login_page = 'http://www.investopedia.com/accounts/login.aspx'
start_urls = ['http://www.investopedia.com']
def init_request(self):
return scrapy.Request(url=self.login_page, callback=self.login)
def parse(self, response):
return scrapy.FormRequest('http://www.investopedia.com/accounts/login.aspx',
formdata={'email': 'you_email', 'password': 'your_password',
'form_build_id': 'form - v14V92zFkSSVFSerfvWyH1WEUoxrV2khjfhAETJZydk',
'form_id': 'account_api_form',
'op': 'Sign in'
},
meta = {'dont_redirect': True, 'handle_httpstatus_list':[302]},
callback=self.check_login_response)
def check_login_response(self, response):
return scrapy.Request('http://www.investopedia.com/accounts/manageprofile.aspx', self.validate_login)
def validate_login(self, response):
if "myname" in response.body:
self.log("Successfully logged in. Let's start crawling!")
self.initialized()
else:
self.log("Login was unsuccessful")
def parse_item(self, response):
print 'I got in here, finally!!!!'
pass