尝试做一个简单的Scrapy python脚本,它将从使用Web表单的站点中抓取数据,如下所示:
onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ctl00$ctl00$ContentPlaceHolderDefault$DivMainCPH$ctl01$NRANearYouControl_2$imgbtn_Locate", "", true, "", "", false, false))"
让我们说我的网站是example.com,下面是我的python蜘蛛不能正常工作,但让我回到HTML但不是实际POST请求的响应。
必须有一种方法可以取回asp.net使用的__VIEWSTATE并在使用正确的表单数据重新发送POST请求时捕获它。我正在使用Chrome调试器尝试准确获取请求的发送内容。
from scrapy.spider import Spider
from scrapy.http import FormRequest
from scrapy.selector import Selector
from scrapy.http import Request
class ShootSpider(Spider):
name = "shoot"
allowed_domains = ["example.com"]
start_urls = ["http://example.com"]
def parse(self, response):
return Request("http://findtest.example.com/",
callback=self.parse_2)
def parse_2(self, response):
sel = Selector(response)
self.log('A VIEW_STATE %s' % sel.xpath('//__VIEWSTATE').extract())
self.log('A HEADER %s' % response.headers)
self.log('A META %s' % response.meta)
return [FormRequest(url="http://findtest.example.com/",
formdata={'__EVENTTARGET': '', '__EVENTARGUMENT': '', '__VIEWSTATE': '',
'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24LocationTextBox': 'richmond+VA',
'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24ddlMiles': '200.1',
'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24chkGrids%248': 'dg1',
'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24imgbtn_Locate.x': '43',
'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24imgbtn_Locate.y': '12'},
callback=self.after_post)]
def after_post(self, response):
self.log('A response from %s' % response.url)
#self.log('Body %s' % response.body)
filename = "Example_HTML_SCRAPE"
open(filename, 'wb').write(response.body)
我正在收回HTTP状态200 - 在爬行后确定
以下是命令行抓取统计信息:
DEBUG:Crawled(200)http://findtest.example.com/> (引用者:http://findtest.example.com/)
DEBUG:来自http://findtest.example.com/
的回复信息:关闭蜘蛛(已完成)
信息:倾倒Scrapy统计数据:
'downloader/request_bytes': 1451,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 100405,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 4, 24, 2, 42, 42, 766547),
'log_count/DEBUG': 9,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2014, 4, 24, 2, 42, 41, 381395)}