我有一个非常基本的蜘蛛:
import scrapy
class CdubotSpider(scrapy.Spider):
name = 'cdubot'
start_urls = ['https://union.fsu.edu/up/upcoming-events/']
custom_settings = {
'FEED_URI': 'output/cduoutput.json'
}
def parse(self, response):
for href in response.xpath('//a[@class="cover"]/@href').extract():
yield response.follow(href, self.parse_concert)
def parse_concert(self, response):
yield {
"website" : response.request.url,
}
这不返回任何内容。如果删除[@class="cover"]
,它将返回一个链接列表,但该列表不包含我想要的链接。这是我要关注的链接周围的标记示例:
<li class="event">
<article class="event-card">
<div class="event-overview">
<header>
<h2 class="event-title">Adulting 101</h2>
<time class="event-short-date" datetime="2018-07-06"> 06 <abbr title="July"> Jul </abbr> </time> <img alt="Adulting 101" class="event-img" height="225" src="https://d3e1o4bcbhmj8g.cloudfront.net/photos/678822/square_300/24625c20b3ddbd8771591e95150b15e06e6a24a2.jpg" width="225">
</header>
<div class="content">
<p>New to college? Club Downunder is here to help! Come out to Adulting 101 in the SLC 101s to learn about topics from healthy eating to stress management. There will be a DIY...</p>
</div>
</div>
<div class="event-details">
<strong class="event-detail-title">Adulting 101</strong>
<dl class="event-specs">
<dt class="event-date">
Date
<div class="clock"></div>
</dt>
<dd class="event-date"> <time datetime="2018-07-06"> Friday, July 6 </time> </dd>
<dt class="event-location">
Location
<div class="pin"></div>
</dt>
<dd class="event-location"> Askew Student Life Building (SLB) </dd>
</dl>
</div>
<a class="cover 0" href="https://calendar.fsu.edu/event/adulting101_cdu?utm_campaign=widget&utm_medium=widget&utm_source=Florida+State+University+Calendar" target="_blank" rel="nofollow">Adulting 101</a> <span class="start-time location"> 07:00 pm - Askew Student Life Building (SLB) </span>
</article>
</li>
与我想要的标签中的rel="nofollow"
有关系吗?我所有搜索StackOverflow的问题都是关于如何使Scrapy尊重而不是忽略它的问题,因此我认为默认情况下它会忽略。
这应该是一个快速抓取的网站,但我无法获得所需的链接。我究竟做错了什么?我尝试更详细地说明路径,并使用CSS而不是xpath,但是没有任何效果。什么鬼?
添加回溯:
2018-07-01 13:38:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: thescraper)
2018-07-01 13:38:04 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134
2018-07-01 13:38:04 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'thescraper.spiders', 'FEED_URI': 'output/cduoutput.json', 'SPIDER_MODULES': ['thescraper.spiders'], 'BOT_NAME': 'thescraper', 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'TallyMusicWebScraper (http://www.tallymusic.net)', 'FEED_FORMAT': 'json'}
2018-07-01 13:38:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-07-01 13:38:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-01 13:38:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-01 13:38:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-07-01 13:38:05 [scrapy.core.engine] INFO: Spider opened
2018-07-01 13:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-01 13:38:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-01 13:38:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://union.fsu.edu/robots.txt> (referer: None)
2018-07-01 13:38:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://union.fsu.edu/up/upcoming-events/> (referer: None)
2018-07-01 13:38:05 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-01 13:38:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 481,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 15518,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 1, 17, 38, 5, 855000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 7, 1, 17, 38, 5, 404000)}
2018-07-01 13:38:05 [scrapy.core.engine] INFO: Spider closed (finished)