大家好,我是编程新手,所以如果我缺少任何重要信息,请告诉我。
我目前正在使用刮板刮擦this website,但由于它使用_doPostBack
,所以我在分页方面遇到了问题href =“ javascript:__ doPostBack('ctl00 $ cpContent $ rg_MemberList $ ctl00 $ ctl02 $ ctl00 $ ctl 05 ','')”
加粗文本基本上是唯一以2为增量增加的内容,现在我可以抓紧搜索直到第11页
ctl00 $ cpContent $ rg_MemberList $ ctl00 $ ctl02 $ ctl00 $ ctl 25
但是,Scrapy似乎无法进入第12页,并且在检查Chrome中的元素时,我注意到第12页的__EVENTTARGET返回
ctl00 $ cpContent $ rg_MemberList $ ctl00 $ ctl02 $ ctl00 $ ctl 09
所以我尝试将__EVENTTARGET,__EVENTARGUMENT,__VIEWSTATE和__EVENTVALIDATION添加为表单数据,但是仍然无法正常工作。
class FmmSpider(scrapy.Spider):
name = 'FMM'
url = 'http://www.fmm.org.my/Member_List.aspx'
def start_requests(self):
yield scrapy.Request(url=self.url, callback=self.parse_form)
def parse_form(self, response):
selector = scrapy.Selector(response=response)
VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()
for page_number in range(50):
formdata = {
# change pages here
"__EVENTTARGET": "ctl00$cpContent$rg_MemberList$ctl00$ctl02$ctl00$ctl{page_number:02d}".format(page_number=page_number),
"__EVENTARGUMENT": "",
"__VIEWSTATE": VIEWSTATE,
"__EVENTVALIDATION": EVENTVALIDATION,
}
yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse_list)
def parse_list(self, response):
urls = response.css('table.memberSearchTable > tr > td > div > a::attr(href)').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self,response):
yield {
'company_name': response.css('span#ctl00_cpContent_lbl_CompanyName::text').extract_first(),
'website': response.css('td > span#ctl00_cpContent_lbl_Website > a::text').extract_first(),
'email': response.css('td > span#ctl00_cpContent_lbl_Email > a::text').extract_first(),
'telephone': response.css('span#ctl00_cpContent_lbl_Tel::text').extract_first(),
'business_enquiry': response.css('span#ctl00_cpContent_lbl_BuisnessEnquiry::text').extract_first(),
'brand_names': response.css('span#ctl00_cpContent_lbl_Brand::text').extract_first(),
'products_services': response.css('span#ctl00_cpContent_lbl_Product::text').extract_first(),
}
任何建议将不胜感激,谢谢!