Question

尝试做一个简单的Scrapy python脚本，它将从使用Web表单的站点中抓取数据，如下所示：

onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ctl00$ctl00$ContentPlaceHolderDefault$DivMainCPH$ctl01$NRANearYouControl_2$imgbtn_Locate", "", true, "", "", false, false))"

让我们说我的网站是example.com，下面是我的python蜘蛛不能正常工作，但让我回到HTML但不是实际POST请求的响应。

必须有一种方法可以取回asp.net使用的__VIEWSTATE并在使用正确的表单数据重新发送POST请求时捕获它。我正在使用Chrome调试器尝试准确获取请求的发送内容。

from scrapy.spider import Spider
from scrapy.http import FormRequest
from scrapy.selector import Selector
from scrapy.http import Request

class ShootSpider(Spider):
    name = "shoot"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com"]

def parse(self, response):
    return Request("http://findtest.example.com/",
            callback=self.parse_2)

def parse_2(self, response):
   sel = Selector(response)
   self.log('A VIEW_STATE %s' % sel.xpath('//__VIEWSTATE').extract())
   self.log('A HEADER %s' % response.headers)
   self.log('A META %s' % response.meta)
   return [FormRequest(url="http://findtest.example.com/",
       formdata={'__EVENTTARGET': '', '__EVENTARGUMENT': '', '__VIEWSTATE': '',
            'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24LocationTextBox': 'richmond+VA',
            'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24ddlMiles': '200.1',
            'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24chkGrids%248': 'dg1',
            'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24imgbtn_Locate.x': '43',
            'ctl00%24ctl00%24ctl00%24ContentPlaceHolderDefault%24DivMainCPH%24ctl01%24NRANearYouControl_2%24imgbtn_Locate.y': '12'},
            callback=self.after_post)]

def after_post(self, response):
    self.log('A response from %s' % response.url)
    #self.log('Body %s' % response.body)
    filename = "Example_HTML_SCRAPE"
    open(filename, 'wb').write(response.body)

我正在收回HTTP状态200 - 在爬行后确定

以下是命令行抓取统计信息：

DEBUG：Crawled（200）http://findtest.example.com/> （引用者：http://findtest.example.com/）

DEBUG：来自http://findtest.example.com/

的回复

信息：关闭蜘蛛（已完成）

信息：倾倒Scrapy统计数据：

'downloader/request_bytes': 1451,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 100405,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2014, 4, 24, 2, 42, 42, 766547),
 'log_count/DEBUG': 9,
 'log_count/INFO': 7,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2014, 4, 24, 2, 42, 41, 381395)}

使用__VIEWSTATE的Scrapy POST请求不能正常使用javascript WebForm_PostBack

0 个答案: