粗糙的asp.net分页

时间:2019-03-17 10:44:59

标签: python web-scraping scrapy

大家好,我是编程新手,所以如果我缺少任何重要信息,请告诉我。

我目前正在使用刮板刮擦this website,但由于它使用_doPostBack

,所以我在分页方面遇到了问题
  

href =“ javascript:__ doPostBack('ctl00 $ cpContent $ rg_MemberList $ ctl00 $ ctl02 $ ctl00 $ ctl 05 ','')”

加粗文本基本上是唯一以2为增量增加的内容,现在我可以抓紧搜索直到第11页

  

ctl00 $ cpContent $ rg_MemberList $ ctl00 $ ctl02 $ ctl00 $ ctl 25

但是,Scrapy似乎无法进入第12页,并且在检查Chrome中的元素时,我注意到第12页的__EVENTTARGET返回

  

ctl00 $ cpContent $ rg_MemberList $ ctl00 $ ctl02 $ ctl00 $ ctl 09

所以我尝试将__EVENTTARGET,__EVENTARGUMENT,__VIEWSTATE和__EVENTVALIDATION添加为表单数据,但是仍然无法正常工作。

class FmmSpider(scrapy.Spider):
name = 'FMM'
url = 'http://www.fmm.org.my/Member_List.aspx'



def start_requests(self):
    yield scrapy.Request(url=self.url, callback=self.parse_form)

def parse_form(self, response):
    selector = scrapy.Selector(response=response)
    VIEWSTATE = selector.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first()
    EVENTVALIDATION = selector.xpath('//*[@id="__EVENTVALIDATION"]/@value').extract_first()

    for page_number in range(50):
        formdata = {
      # change pages here
        "__EVENTTARGET": "ctl00$cpContent$rg_MemberList$ctl00$ctl02$ctl00$ctl{page_number:02d}".format(page_number=page_number),
        "__EVENTARGUMENT": "",
        "__VIEWSTATE": VIEWSTATE,
        "__EVENTVALIDATION": EVENTVALIDATION,
        }
        yield scrapy.FormRequest(url=self.url, formdata=formdata, callback=self.parse_list)


def parse_list(self, response):
    urls = response.css('table.memberSearchTable > tr > td > div > a::attr(href)').extract()
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url, callback=self.parse_details)


def parse_details(self,response):
    yield {
        'company_name': response.css('span#ctl00_cpContent_lbl_CompanyName::text').extract_first(),
        'website': response.css('td > span#ctl00_cpContent_lbl_Website > a::text').extract_first(),
        'email': response.css('td > span#ctl00_cpContent_lbl_Email > a::text').extract_first(),
        'telephone': response.css('span#ctl00_cpContent_lbl_Tel::text').extract_first(),
        'business_enquiry': response.css('span#ctl00_cpContent_lbl_BuisnessEnquiry::text').extract_first(),
        'brand_names': response.css('span#ctl00_cpContent_lbl_Brand::text').extract_first(),
        'products_services': response.css('span#ctl00_cpContent_lbl_Product::text').extract_first(),
    }

任何建议将不胜感激,谢谢!

0 个答案:

没有答案