我有问题。 移至下一页后如何下载数据? 它仅从首页下载。 我粘贴我的代码:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
class PronobelSpider(Spider):
name = 'pronobel'
allowed_domains = ['pronobel.pl']
start_urls = ['http://pronobel.pl/praca-opieka-niemcy/']
def parse(self, response):
jobs = response.xpath('//*[@class="offer offer-immediate"]')
for job in jobs:
title = job.xpath('.//*[@class="offer-title"]/text()').extract_first()
start_date = job.xpath('.//*[@class="offer-attr offer-departure"]/text()').extract_first()
place = job.xpath('.//*[@class="offer-attr offer-localization"]/text()').extract_first()
language = job.xpath('.//*[@class="offer-attr offer-salary"]/text()').extract()[1]
print title
print start_date
print place
print language
next_page_url = response.xpath('//*[@class="page-nav nav-next"]/a/@href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url)
我只能从首页获取数据
答案 0 :(得分:1)
您的问题不在于抓取下一页,而是您的选择器上的问题。
首先,当按类选择元素时,它是recommended to use css。
发生了什么事,其他页面上没有类offer-immediate
的元素。
我对您的代码做了一些更改,请参见下面的提示
:from scrapy import Spider
from scrapy.http import Request
class PronobelSpider(Spider):
name = 'pronobel'
allowed_domains = ['pronobel.pl']
start_urls = ['http://pronobel.pl/praca-opieka-niemcy/']
def parse(self, response):
jobs = response.css('div.offers-list div.offer')
for job in jobs:
title = job.css('a.offer-title::text').extract_first()
start_date = job.css('div.offer-attr.offer-departure::text').extract_first()
place = job.css('div.offer-attr.offer-localization::text').extract_first()
language = job.css('div.offer-attr.offer-salary::text').extract()[1]
yield {'title': title,
'start_date': start_date,
'place': place,
'language': language,
'url': response.url}
next_page_url = response.css('li.page-nav.nav-next a::attr(href)').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url)
答案 1 :(得分:0)
我也尝试过:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
class PronobelSpider(Spider):
name = 'pronobel'
allowed_domains = ['pronobel.pl']
start_urls = ['http://pronobel.pl/praca-opieka-niemcy']
def parse(self, response):
jobs = response.xpath('//*[@class="offer offer-immediate"]')
for job in jobs:
title = job.xpath('.//*[@class="offer-title"]/text()').extract_first()
start_date = job.xpath('.//*[@class="offer-attr offer-departure"]/text()').extract_first()
place = job.xpath('.//*[@class="offer-attr offer-localization"]/text()').extract_first()
language = job.xpath('.//*[@class="offer-attr offer-salary"]/text()').extract()[1]
yield {'place' : place}
next_page_url = response.xpath('//*[@class="page-nav nav-next"]/a/@href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url)
回复:
2019-03-20 17:58:28 [scrapy.core.engine] INFO: Spider opened
2019-03-20 17:58:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-20 17:58:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6025
2019-03-20 17:58:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://pronobel.pl/praca-opieka-niemcy> from <GET http://pronobel.pl/praca-opieka-niemcy>
2019-03-20 17:58:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pronobel.pl/praca-opieka-niemcy> (referer: None)
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Ratingen'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Burg Stargard'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Fahrenzhausen'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Meerbusch'}
2019-03-20 17:58:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pronobel.pl/praca-opieka-niemcy>
{'place': u'Geislingen an der Steige T\xfcrkheim/Deutschland'}
2019-03-20 17:58:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pronobel.pl/praca-opieka-niemcy?page_nr=2> (referer: https://pronobel.pl/praca-opieka-niemcy)
2019-03-20 17:58:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pronobel.pl/praca-opieka-niemcy?page_nr=3> (referer: https://pronobel.pl/praca-opieka-niemcy?page_nr=2)
2019-03-20 17:58:29 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://pronobel.pl/praca-opieka-niemcy?page_nr=3> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-03-20 17:58:29 [scrapy.core.engine] INFO: Closing spider (finished)