伙计们,我一直试图从我的网站上删除一些数据。我正在使用python scrapy。
然而,在浏览完文档后,当我在我的网站上尝试使用此HTML表单时,一切看起来都很好:
<form action="http://mywebsite.com/login/process" method="post">
<div class="body bg-gray">
<div class="form-group">
<input type="text" name="userid" class="form-control" placeholder="User ID" autocomplete="off">
</div>
<div class="form-group">
<input type="password" name="password" class="form-control" placeholder="Password">
</div>
</div>
<div class="footer">
<button type="submit" name="tempLoginProcess" value="" class="btn bg-olive btn-block">Sign me in</button>
</div>
</form>
为此我使用下面的PYTHON SCRAPY代码:
import scrapy
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
class LoginSpider(scrapy.Spider):
name = 'mywebsite.com'
start_urls = ['http://mywebsite.com/login']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata = {
'userid': 'admin',
'password': 'admin',
},
callback = self.after_login
)
def after_login(self, response): #check login succeed before going on
dat = self.log(response.body)
return dat
现在出现问题:
我再次尝试登录我的另一个网站的其他帐户,它的形式如下所示(这很复杂):
<form accept-charset="UTF-8" action="/users/sign_in" html="{:onsubmit=>"if($(this).valid()) $('input[type=\"submit\"]').attr('disabled','disabled');"}" method="post">
<div style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓">
<input name="authenticity_token" type="hidden" value="Luvho/8odzEsVYhteyYtkwUhN0whT6nlFj4W4wth//s=">
</div>
<div align="center" class="alert-alert" style="margin-left: 10px;font-size:12px;color:red;">Email or password is incorrect. Please try again or click on Forgot Password</div>
<div class="col-md-12 signupemail">
<input id="user_email" name="user[email]" placeholder="Email" size="30" type="email">
</div>
<div class="col-md-12 signuppassword">
<input id="user_password" name="user[password]" placeholder="Password" size="30" type="password">
</div>
<div class="col-md-12 signupsubmit">
<button type="submit" class="btn" id="">Submit</button>
</div>
此表单位于colorbox / ligtbox
中现在我正在尝试这样:
import scrapy
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
class LoginSpider(scrapy.Spider):
name = 'my2website.com'
start_urls = ['http://www.my2website.com/users/sign_in']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata = {
'user': {
'email': 'fabdeal@my2website.com',
'password': 'my2website@123'
}
},
callback = self.after_login
)
def after_login(self, response): #check login succeed before going on
dat = self.log(response.body)
return dat
它不会转到下一页,仍然只打印登录页面。这肯定意味着登录没有成功。你们可以检查并帮助我理解错误。
我认为这是最终结果:
2015-12-04 03:02:21 [scrapy] INFO: Enabled item pipelines:
2015-12-04 03:02:21 [scrapy] INFO: Spider opened
2015-12-04 03:02:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-04 03:02:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-04 03:02:23 [scrapy] DEBUG: Crawled (200) <GET http://www.my2website.com/users/sign_in> (referer: None)
2015-12-04 03:02:26 [scrapy] DEBUG: Crawled (200) <GET http://www.my2website.com/search_terms/search_for_user?utf8=%E2%9C%93&term=&commit=&user=password&user=email> (referer: http://www.my2website.com/users/sign_in)
2015-12-04 03:02:26 [scrapy] INFO: Closing spider (finished)
2015-12-04 03:02:26 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 899,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 40537,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 3, 21, 32, 26, 841202),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 12, 3, 21, 32, 21, 846934)}
如果需要进一步的信息,请告诉我。
只是我是Scrapping and Scrapy的新手
**这是我无法废弃的网站** ORIGINAL WEBSITE LINK
答案 0 :(得分:3)
您需要在登录POST中传递字段authenticity_token
的值,这是一项安全措施。它被称为同步器令牌,用于防止CSRF攻击,read here以获取有关该主题的更多信息。
所以你的解析函数应该是:
def parse(self, response):
# parse the security token
token = response.css('input[name=authenticity_token]::attr(value)').extract_first()
return scrapy.FormRequest.from_response(
response,
formdata = {
'user': {
'email': 'fabdeal@my2website.com',
'password': 'my2website@123'
'authenticity_token': token
}
},
callback = self.after_login
)
希望它有效。