试图弄清楚scrapy如何运作并使用它来查找论坛上的信息。
items.py
import scrapy
class BodybuildingItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
pass
spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem
class BodyBuildingSpider(BaseSpider):
name = "bodybuilding"
allowed_domains = ["forum.bodybuilding.nl"]
start_urls = [
"https://forum.bodybuilding.nl/fora/supplementen.22/"
]
def parse(self, response):
responseSelector = Selector(response)
for sel in responseSelector.css('li.past.line.event-item'):
item = BodybuildingItem()
item['title'] = sel.css('a.data-previewUrl::text').extract()
yield item
我试图从这个例子中获取帖子标题的论坛是:https://forum.bodybuilding.nl/fora/supplementen.22/
但是我一直没有得到任何结果:
类BodyBuildingSpider(BaseSpider):2017-10-07 00:42:28 [scrapy.utils.log]信息:Scrapy 1.4.0开始(机器人:健美) 2017-10-07 00:42:28 [scrapy.utils.log]信息:重写设置: {' NEWSPIDER_MODULE':' bodybuilding.spiders',' SPIDER_MODULES': [' bodybuilding.spiders'],' ROBOTSTXT_OBEY':是的,' BOT_NAME': '健美'} 2017-10-07 00:42:28 [scrapy.middleware]信息:已启用 扩展:[' scrapy.extensions.memusage.MemoryUsage', ' scrapy.extensions.logstats.LogStats&#39 ;, ' scrapy.extensions.corestats.CoreStats'] 2017-10-07 00:42:28 [scrapy.middleware]信息:启用下载中间件: [' scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware&#39 ;, ' scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39 ;, ' scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39 ;, ' scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39 ;, ' scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39 ;, ' scrapy.downloadermiddlewares.retry.RetryMiddleware&#39 ;, ' scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39 ;, ' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39 ;, ' scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39 ;, ' scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39 ;, ' scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39 ;, ' scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-07 00:42:28 [scrapy.middleware]信息:启用蜘蛛中间件: [' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39 ;, ' scrapy.spidermiddlewares.offsite.OffsiteMiddleware&#39 ;, ' scrapy.spidermiddlewares.referer.RefererMiddleware&#39 ;, ' scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39 ;, ' scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-07 00:42:28 [scrapy.middleware]信息:启用项目管道:[] 2017-10-07 00:42:28 [scrapy.core.engine]信息:蜘蛛开启2017-10-07 00:42:28 [scrapy.extensions.logstats]信息:抓取0页(0页/分钟), 刮掉0件(0件/分)2017-10-07 00:42:28 [scrapy.core.engine] DEBUG:Crawled(404)https://forum.bodybuilding.nl/robots.txt> (推荐人:无)2017-10-07 00:42:29 [scrapy.core.engine] DEBUG:Crawled(200)https://forum.bodybuilding.nl/fora/supplementen.22/> (引用者:无) 2017-10-07 00:42:29 [scrapy.core.engine]信息:关闭蜘蛛 (完)2017-10-07 00:42:29 [scrapy.statscollectors]信息:倾倒 Scrapy统计数据:{' downloader / request_bytes':469, ' downloader / request_count':2,' downloader / request_method_count / GET': 2,' downloader / response_bytes':22878,' downloader / response_count': 2,' downloader / response_status_count / 200':1, ' downloader / response_status_count / 404':1,' finish_reason': '已完成',' finish_time':datetime.datetime(2017,10,6,22,42,29, 223305),' log_count / DEBUG':2,' log_count / INFO':7,' memusage / max': 31735808,' memusage / startup':31735808,' response_received_count': 2,' scheduler / dequeued':1,' scheduler / dequeued / memory':1, ' scheduler / enqueued':1,' scheduler / enqueued / memory':1, ' start_time&#39 ;: datetime.datetime(2017,10,6,22,42,28,816043)} 2017-10-07 00:42:29 [scrapy.core.engine]信息:蜘蛛关闭 (成品)
我一直在这里关注指南:http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html
更新1:
有人告诉我,我需要将我的代码更新为新标准,但我没有改变结果:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem
class BodyBuildingSpider(BaseSpider):
name = "bodybuilding"
allowed_domains = ["forum.bodybuilding.nl"]
start_urls = [
"https://forum.bodybuilding.nl/fora/supplementen.22/"
]
def parse(self, response):
for sel in response.css('li.past.line.event-item'):
item = BodybuildingItem()
yield {'title': title.css('a.data-previewUrl::text').extract_first()}
yield item
上次更新并修复
经过一些好的帮助后,我终于让它与这只蜘蛛一起工作了:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'bodybuilding'
start_urls = ['https://forum.bodybuilding.nl/fora/supplementen.22/']
def parse(self, response):
for title in response.css('h3.title'):
yield {'title': title.css('a::text').extract_first()}
next_page_url = response.xpath("//a[text()='Volgende >']/@href").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
答案 0 :(得分:1)
您应该使用response.css('li.past.line.event-item')
,而不需要responseSelector = Selector(response)
。
您使用的li.past.line.event-item
CSS也不再有效,因此您需要先根据最新网页更新这些内容
要获取下一页网址,您可以使用
>>> response.css("a.text::attr(href)").extract_first()
'fora/supplementen.22/page-2'
然后使用response.follow
来关注此相对网址
编辑-2:下一页处理更正
之前的编辑无效,因为在下一页上它与上一页的网址相匹配,因此您需要在下面使用
next_page_url = response.xpath("//a[text()='Volgende >']/@href").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
编辑-1:下一页处理
next_page_url = response.css("a.text::attr(href)").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)