我正在尝试提取这些搜索结果中每个rv单位详细信息页面的链接以及搜索结果的下一页,以便我可以找到他们在site上拥有的每个rv单位的链接>
import scrapy
class cwscrape(scrapy.Spider):
name = 'rvlinks'
start_urls = ['https://rv.campingworld.com/searchresults?condition=new_used&custompricerange=true&custompaymentrange=true&sort=featured_asc&zipsearch=true&search_mode=advanced&locations=nationwide']
def parse(self, response):
for rvname in response.xpath("//div[@class='title']"):
yield{ 'rv_full_name': rvname.xpath(".//span[@itemprop='name']/text()").extract_first()}
next_page= response.xpath(".//div[@class='pagination-wrap']/a/@href").extract_first()
if next_page is not None:
next_page_link= response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
每个详细信息单元的示例url为:
https://rv.campingworld.com/rvdetails/new-class-c-rvs/2019-thor-freedom-elite-26he-front-living-60k-BKY1571461
答案 0 :(得分:0)
我已经在scrapy shell
中尝试了您的代码,一切看起来都很好:
In [5]: response.xpath("//div[@class='title']//span[@itemprop='name']/text()").extract()
Out[5]:
[u'2019 THOR FREEDOM ELITE 22HEC',
u'2018 THOR GEMINI 23TR',
u'2018 THOR GEMINI 23TK',
u'2019 THOR FREEDOM ELITE 24HE',
u'2019 WINNEBAGO MINNIE WINNIE 22R',
u'2019 WINNEBAGO MINNIE WINNIE 22M',
u'2019 WINNEBAGO OUTLOOK 27D',
u'2019 THOR FREEDOM ELITE 28FE',
u'2019 WINNEBAGO MINNIE WINNIE 25B',
u'2019 THOR FREEDOM ELITE 28FE',
u'2019 WINNEBAGO OUTLOOK 31N',
u'2019 THOR QUANTUM RC25',
u'2018 THOR SYNERGY JR24',
u'2019 WINNEBAGO MINNIE WINNIE 26A',
u'2019 THOR QUANTUM KM24',
u'2019 WINNEBAGO MINNIE WINNIE 31G',
u'2019 THOR SYNERGY 24SJ',
u'2019 WINNEBAGO VIEW 24G',
u'2019 WINNEBAGO VIEW 24V',
u'2019 WINNEBAGO OUTLOOK 22E']
In [6]: response.xpath(".//div[@class='pagination-wrap']/a/@href").get()
Out[6]: u'https://rv.campingworld.com/searchresults?condition=new_used&custompricerange=true&custompaymentrange=true&sort=featured_asc&zipsearch=true&search_mode=advanced&locations=nationwide&scpc=&make=&landingMake=0&page=1'
您遇到了什么样的问题?