由于某种原因,我的蜘蛛不想转到下一页。它没有给我任何错误,但只刮了一页。 我有与此类似的代码,但还有另一个站点,它运行良好。
from scrapy.spiders import CrawlSpider
from scrapy import Request
class JobsSpider(CrawlSpider):
name = 'jobs'
allowed_domains = ['https://newyork.craigslist.org/search/egr']
start_urls = ['https://newyork.craigslist.org/search/egr/']
def parse(self, response):
jobs = response.css(".result-info")
for job in jobs:
Dates = response.css(".result-date").extract_first()
Titles = job.css('.hdrlnk::text').extract_first()
address = job.css(".result-hood::text").extract_first()
relative_url = job.css(".hdrlnk::attr('href')").extract_first()
yield{
"Date": Dates,
"Title": Titles,
"Address": address,
"Link": relative_url
}
url = response.xpath('//*[@id="searchform"]/div[5]/div[3]/span[2]/a[3]/@href').extract_first()
absurl = response.urljoin(url)
if url:
yield Request(url=absurl, callback=self.parse)
else:
print("No next page found")
答案 0 :(得分:0)
您将allowed_domains
设置得过于严格,因此基于“域”将不允许使用新的URL。
所以只需更改
allowed_domains = ['https://newyork.craigslist.org/search/egr']
收件人
allowed_domains = ['craigslist.org']