我正在尝试使用刮擦刮擦this website(有多个页面)。问题是我找不到下一页URL。 您是否有关于如何刮擦具有多个页面(刮擦)的网站或如何解决代码中出现的错误的想法?
我尝试了以下代码,但无法正常工作:
class AbcdspiderSpider(scrapy.Spider):
"""
Class docstring
"""
name = 'abcdspider'
allowed_domains = ['abcd-terroir.smartrezo.com']
alphabet = list(string.ascii_lowercase)
url = "https://abcd-terroir.smartrezo.com/n31-france/annuaireABCD.html?page=1&spe=1&anIDS=31&search="
start_urls = [url + letter for letter in alphabet]
main_url = "https://abcd-terroir.smartrezo.com/n31-france/"
crawl_datetime = str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
start_time = datetime.datetime.now()
def parse(self, response):
self.crawler.stats.set_value("start_time", self.start_time)
try:
page = response.xpath('//div[@class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)
for index in range(page_max):
producer_list = response.xpath('//div[@class="clearfix encart_ann"]/@onclick').getall()
for producer in producer_list:
link_producer = self.main_url + producer
yield scrapy.Request(url=link_producer, callback=self.parse_details)
next_page_url = "/annuaireABCD.html?page={}&spe=1&anIDS=31&search=".format(index)
if next_page_url is not None:
yield scrapy.Request(response.urljoin(self.main_url + next_page_url))
except Exception as e:
self.crawler.stats.set_value("error", e.args)
我收到此错误:
'error': ('range() integer end argument expected, got unicode.',)
答案 0 :(得分:2)
错误在这里:
page = response.xpath('//div[@class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)
范围函数期望的是整数值(1、2、3、4等),而不是Unicode字符串(“第1页/ 403”页 )
我对范围错误的建议是
page = response.xpath('//div[@class="pageStuff"]/span/text()').get().split('/ ')[1]
for index in range(int(page)):
#your actions