I trying to do web crawling using python. But I can't figure out how to change pages automatically. So I found the pattern but I don't know how to go to next page automatically until it reaches end of page.
so the pattern is
'http//.../sortBy=helpful&pageNumber=0'
'http//.../sortBy=helpful&pageNumber=1'
'http//.../sortBy=helpful&pageNumber=2'
'http//.../sortBy=helpful&pageNumber=3'
and so on ...
import re
from urllib.parse import urljoin
def review_next_page(page=1):
list_url = 'https://www.amazon.com/Quest-Nutrition-Protein-Apple-2-12oz/product-reviews/B00U3RGAMW/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber={0}'.format(page)
list_url = [urljoin(list_url, review_link) for review_link in ???]
return list_url
I am trying to change last number increases by 1 until it reaches the end... Should I use for loop?
Thanks in advance!
答案 0 :(得分:0)
不直接回答问题,但Scrapy
CrawlSpider
class link extractors和Swift tuples aren't Equatable
可以轻松方便地处理这个问题。您可以配置href
匹配的模式,以便遵循要链接的链接。在你的情况下,它会是这样的:
Rule(LinkExtractor(allow=r'sortBy=helpful&pageNumber=\d+$'), callback=self.parse_page)