Question

我正在使用scrapy来抓取网站（link）。我使用for循环和yield函数从

中删除了此页面中的所有数据

def parse(self, response):
    self.main_cat=response.xpath('//div[@id="products_content"]/div/text()').extract()
    self.sub_cat=response.xpath('//div[@class="accordion"]/div[@class="title"]/text()').extract()
    Onclick=response.xpath('//div[@class="accordion"]/div[@class="no_title subtitle_chck"]/@onclick').extract()
    for index in range(len(Onclick)):
        sub_sub_cat=response.xpath('//div[@class="accordion"]/div[@class="no_title subtitle_chck"]/label/text()').extract_first()
        removeSearchWord=Onclick[index].replace("submitSearch(","")
        numericData=removeSearchWord.replace(");","").split(',')
        absolute_url="https://portal.orio.com/webapp/wcs/stores/servlet/SearchDisplayView?storeId=11901&catalogId=10051&langId=-150&pageView=detailed&beginIndex=0&sType=SimpleSearch&categoryId="+numericData[0]+"&showResultsPage=true&navCat="+numericData[1]+"_"+numericData[2]+"&urlLangId=-150&removeFiltersOg=ALL&sortField=name&orderBy=7"
        yield Request(absolute_url, callback=self.page)

def page(self,response):
    product_page_url=response.xpath('//td[@class="information"]/a/@href').extract()
    for url in product_page_url:
        yield Request(url, callback=self.product)

在最后一个yield函数之后，哪行代码导致我继续抓取所有其他页面。我知道需要一些ajax调用，但我不知道如何为它们实现。你想添加那行代码，因为我尝试了很多寻找解决方案，我的最后一个问题也是关于这个问题得到了很好的回答，但我没有得到。

Answer 1

实际上下一页网址就在那里。它的<a>节点包含<img>节点，其中包含图片paging_next.png：

如果您查看该节点，您可以看到附加的onclick javascript脚本将浏览器网址更改为下一页网址：您可以使用xpath选择器和一些正则表达式提取它：

url = response.xpath('//a[contains(img/@src,"paging_next")]/@onclick').re("setPage\('(.+?)'")[0]
Out[1]: 'https://portal.orio.com/webapp/wcs/stores/servlet/AjaxCatalogSearchResultView?pageView=detailed&searchTermScope=&orderBy=7&categoryId=146003&beginIndex=25&pageSize=25&maxPrice=&searchType=1002&sortField=name&resultCatEntryType=&searchTerm=&sType=SimpleSearch&filterTerm=&manufacturer=&catalogId=10051&langId=-150&showResultsPage=true&storeId=11901&metaData=YnV5YWJsZToxPE1UQFNQPi1zdXBlcnNlc3Npb246KDEgMyA3KSBBTkQgcHJpY2VfU0VLXzIxOlsqIFRPICpdIEFORCAtcHJpY2VfU0VLXzIxOlsqIFRPIDBdPE1UQFNQPnB1Ymxpc2hlZDox&minPrice='

这是一个丑陋的网址，但它在scrapy中工作得很好:)

一般分页逻辑看起来像这样：

def parse(self, response):
    product_urls = ...
    for url in product_urls:
        yield Request(url, self.parse_product)
    # next page
    next_page = ...  
    if next_page:
        yield Request(next_page, self.parse)
    else:
        self.log('oh no, last page was: {}'.format(response.url), level=logging.INFO)

如果两个页面都有相同的链接，如何在Scraping期间移动到下一页？

1 个答案: