我也想从下一页上抓取内容,但没有转到下一页。我的代码是:
import scrapy
class AggregatorSpider(scrapy.Spider):
name = 'aggregator'
allowed_domains = ['startech.com.bd/component/processor']
start_urls = ['https://startech.com.bd/component/processor']
def parse(self, response):
processor_details = response.xpath('//*[@class="col-xs-12 col-md-4 product-layout grid"]')
for processor in processor_details:
name = processor.xpath('.//h4/a/text()').extract_first()
price = processor.xpath('.//*[@class="price space-between"]/span/text()').extract_first()
print ('\n')
print (name)
print (price)
print ('\n')
next_page_url = response.xpath('//*[@class="pagination"]/li/a/@href').extract_first()
# absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_url)
我没有使用urljoin,因为next_page_url给了我整个URL。我还尝试了yield函数中的 dont_filter = true 参数,该参数使我在第一个页面上无限循环。我从终端收到的消息是 [scrapy.spidermiddlewares.offsite]调试:过滤到“ www.startech.com.bd”的异地请求:https://www.startech.com.bd / component / processor?page = 2>
答案 0 :(得分:2)
这是因为您的allowed_domains
变量错误,请使用allowed_domains = ['www.startech.com.bd']
代替(see the doc)。
您还可以修改您的下一页选择器,以避免再次进入第一页:
import scrapy
class AggregatorSpider(scrapy.Spider):
name = 'aggregator'
allowed_domains = ['www.startech.com.bd']
start_urls = ['https://startech.com.bd/component/processor']
def parse(self, response):
processor_details = response.xpath('//*[@class="col-xs-12 col-md-4 product-layout grid"]')
for processor in processor_details:
name = processor.xpath('.//h4/a/text()').extract_first()
price = processor.xpath('.//*[@class="price space-between"]/span/text()').extract_first()
yield({'name': name, 'price': price})
next_page_url = response.css('.pagination li:last-child a::attr(href)').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url)