昨天我开始学习Scrapy来提取一些信息,但我似乎无法正确分页。我按照教程here进行了操作,但我认为该网站有不同的分页系统。
大多数分页都有 class =“next”,但这个没有。它只有一个列表,其中当前页面被列为具有当前类的范围:
<div class="pagination">
<ul class="page-numbers">
<li><span class='page-numbers current'>1</span></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/2/'>2</a></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/3/'>3</a></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/4/'>4</a></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/5/'>5</a></li>
</ul>
</div>
这是我的刮刀:
import scrapy
class MfwspiderSpider(scrapy.Spider):
name = 'mfwspider'
allowed_domains = ['www.musicfestivalwizard.com']
start_urls = ['https://www.musicfestivalwizard.com/all-festivals/',]
def parse(self, response):
pagenumber = 1
for festival in response.css("span.festivalleft"):
print("-------")
yield {
'date' : festival.css(".festivaldate::text").extract(),
'location' : festival.css(".festivallocation::text").extract_first(),
'title' : festival.css(".festivaltitle > a::text").extract_first(),
}
next_page = start_urls[0] + str(pagenumber) + "/"
print(next_page)
print("^^^^^^^^^^^^^^^^^^")
if next_page is not None:
yield response.follow(next_page, callback=self.parse,)
如您所见,我添加了一些 print()语句进行调试。这是我的控制台输出:
scrapy crawl mfwspider
2018-05-06 00:21:45 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: lineups)
2018-05-06 00:21:45 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-05-06 00:21:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'lineups', 'NEWSPIDER_MODULE': 'lineups.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['lineups.spiders']}
2018-05-06 00:21:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-06 00:21:46 [scrapy.core.engine] INFO: Spider opened
2018-05-06 00:21:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-06 00:21:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-05-06 00:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/robots.txt> (referer: None)
2018-05-06 00:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/> (referer: None)
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'Numero Uno, Malta', 'title': 'Lost And Found Malta 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 6, 2018'], 'location': 'New Orleans, LA', 'title': 'New Orleans Jazz Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 2-May 6, 2018'], 'location': 'West Palm Beach, FL', 'title': 'Sunfest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Memphis, TN', 'title': 'Beale Street Music Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 5-6, 2018'], 'location': 'Liverpool, UK', 'title': 'Liverpool Sound City 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4–6, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Knees Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Concord, NC', 'title': 'Carolina Rebellion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Winooski, VT', 'title': 'Waking Windows 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Texas Tour', 'title': 'JMBLYA 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'San Diego, CA', 'title': 'West Coast Weekender 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 12, 2017'], 'location': 'Australia Tour', 'title': 'Groovin’ The Moo 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 7-13. 2018'], 'location': 'Toronto, ON', 'title': 'Canadian Music Week 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 11-13, 2018'], 'location': 'London, UK', 'title': 'Peckham Rye 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 12-13, 2018'], 'location': 'Somerset, WI', 'title': 'Northern Invasion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 6-13, 2018'], 'location': 'Lyon, France', 'title': 'Nuits Sonores 2018'}
https://www.musicfestivalwizard.com/all-festivals/page/2/
^^^^^^^^^^^^^^^^^^
2018-05-06 00:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/page/2/> (referer: https://www.musicfestivalwizard.com/all-festivals/)
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 12-13, 2018'], 'location': 'Chiba, Japan', 'title': 'Electric Daisy Carnival Japan 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Arcosanti, AZ', 'title': 'FORM Arcosanti Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Beats Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Miami, FL', 'title': 'Rolling Loud Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-19, 2018'], 'location': 'Brighton, UK', 'title': 'The Great Escape 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Gulf Shores, AL', 'title': 'Hangout Fest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Saint-Laurent-de-Cuves, France', 'title': 'Papillons De Nuit 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['June 19-20, 2018'], 'location': 'Margny-lès-Compiègne, France', 'title': 'Imaginarium Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': [' May 18-20, 2018'], 'location': 'Columbus, OH', 'title': 'Rock on the Range 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-20, 2018'], 'location': 'Durham, NC', 'title': 'Moogfest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 19-20, 2018'], 'location': 'Paris, France', 'title': 'Marvellous Island Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Montreal, QC', 'title': 'Pouzza Fest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Houthalen-Helchteren, Belgium', 'title': 'Extrema Outdoor Belgium 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-20, 2018'], 'location': 'Joshua Tree, CA', 'title': 'Joshua Tree Festival Spring 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-21, 2018'], 'location': 'Las Vegas, NV', 'title': 'Electric Daisy Carnival Vegas 2018'}
https://www.musicfestivalwizard.com/all-festivals/
^^^^^^^^^^^^^^^^^^
2018-05-06 00:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/> (referer: https://www.musicfestivalwizard.com/all-festivals/page/2/)
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'Numero Uno, Malta', 'title': 'Lost And Found Malta 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 6, 2018'], 'location': 'New Orleans, LA', 'title': 'New Orleans Jazz Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 2-May 6, 2018'], 'location': 'West Palm Beach, FL', 'title': 'Sunfest 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Memphis, TN', 'title': 'Beale Street Music Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 5-6, 2018'], 'location': 'Liverpool, UK', 'title': 'Liverpool Sound City 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4–6, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Knees Festival 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Concord, NC', 'title': 'Carolina Rebellion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Winooski, VT', 'title': 'Waking Windows 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Texas Tour', 'title': 'JMBLYA 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'San Diego, CA', 'title': 'West Coast Weekender 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 12, 2017'], 'location': 'Australia Tour', 'title': 'Groovin’ The Moo 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 7-13. 2018'], 'location': 'Toronto, ON', 'title': 'Canadian Music Week 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 11-13, 2018'], 'location': 'London, UK', 'title': 'Peckham Rye 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 12-13, 2018'], 'location': 'Somerset, WI', 'title': 'Northern Invasion 2018'}
-------
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 6-13, 2018'], 'location': 'Lyon, France', 'title': 'Nuits Sonores 2018'}
https://www.musicfestivalwizard.com/all-festivals/page/2/
^^^^^^^^^^^^^^^^^^
2018-05-06 00:21:47 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.musicfestivalwizard.com/all-festivals/page/2/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-05-06 00:21:47 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-06 00:21:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1092,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 48590,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 5, 22, 21, 47, 746610),
'item_scraped_count': 45,
'log_count/DEBUG': 51,
'log_count/INFO': 7,
'memusage/max': 66899968,
'memusage/startup': 66899968,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 5, 5, 22, 21, 46, 20038)}
2018-05-06 00:21:47 [scrapy.core.engine] INFO: Spider closed (finished)
我认为我需要在 之后选择li。我怎么能在scrapy中做到这一点?有没有更好的方法呢?
答案 0 :(得分:2)
您可以使用XPath
语句提取下一页。
以下XPath
查找li
所指示的当前页面的class
元素。然后,它会使用下一个li
元素href
。
xpath_next_page = ' .//li/*[@class="page-numbers current"]/parent::li/following-sibling::li[1]/a/@href'
next_page = response.xpath(xpath_next_page).extract_first()
我在网站上对此进行了测试,看起来效果非常好。但我需要添加一些DOWNLOAD_DELAY
,以免被拒绝翻阅所有页面。