Scrapy没有产生结果(抓取0页)

时间:2017-10-06 22:51:57

标签: scrapy

试图弄清楚scrapy如何运作并使用它来查找论坛上的信息。

items.py

import scrapy


class BodybuildingItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    pass

spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem

class BodyBuildingSpider(BaseSpider):
    name = "bodybuilding"
    allowed_domains = ["forum.bodybuilding.nl"]
    start_urls = [
        "https://forum.bodybuilding.nl/fora/supplementen.22/"
    ]

    def parse(self, response):
        responseSelector = Selector(response)
        for sel in responseSelector.css('li.past.line.event-item'):
            item = BodybuildingItem()
            item['title'] = sel.css('a.data-previewUrl::text').extract()
            yield item

我试图从这个例子中获取帖子标题的论坛是:https://forum.bodybuilding.nl/fora/supplementen.22/

但是我一直没有得到任何结果:

  

类BodyBuildingSpider(BaseSpider):2017-10-07 00:42:28   [scrapy.utils.log]信息:Scrapy 1.4.0开始(机器人:健美)   2017-10-07 00:42:28 [scrapy.utils.log]信息:重写设置:   {' NEWSPIDER_MODULE':' bodybuilding.spiders',' SPIDER_MODULES':   [' bodybuilding.spiders'],' ROBOTSTXT_OBEY':是的,' BOT_NAME':   '健美'} 2017-10-07 00:42:28 [scrapy.middleware]信息:已启用   扩展:[' scrapy.extensions.memusage.MemoryUsage',   ' scrapy.extensions.logstats.LogStats&#39 ;,   ' scrapy.extensions.corestats.CoreStats'] 2017-10-07 00:42:28   [scrapy.middleware]信息:启用下载中间件:   [' scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.retry.RetryMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39 ;,   ' scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-07   00:42:28 [scrapy.middleware]信息:启用蜘蛛中间件:   [' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39 ;,   ' scrapy.spidermiddlewares.offsite.OffsiteMiddleware&#39 ;,   ' scrapy.spidermiddlewares.referer.RefererMiddleware&#39 ;,   ' scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39 ;,   ' scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-07 00:42:28   [scrapy.middleware]信息:启用项目管道:[] 2017-10-07   00:42:28 [scrapy.core.engine]信息:蜘蛛开启2017-10-07 00:42:28   [scrapy.extensions.logstats]信息:抓取0页(0页/分钟),   刮掉0件(0件/分)2017-10-07 00:42:28   [scrapy.core.engine] DEBUG:Crawled(404)https://forum.bodybuilding.nl/robots.txt> (推荐人:无)2017-10-07   00:42:29 [scrapy.core.engine] DEBUG:Crawled(200)https://forum.bodybuilding.nl/fora/supplementen.22/> (引用者:无)   2017-10-07 00:42:29 [scrapy.core.engine]信息:关闭蜘蛛   (完)2017-10-07 00:42:29 [scrapy.statscollectors]信息:倾倒   Scrapy统计数据:{' downloader / request_bytes':469,   ' downloader / request_count':2,' downloader / request_method_count / GET':   2,' downloader / response_bytes':22878,' downloader / response_count':   2,' downloader / response_status_count / 200':1,   ' downloader / response_status_count / 404':1,' finish_reason':   '已完成',' finish_time':datetime.datetime(2017,10,6,22,42,29,   223305),' log_count / DEBUG':2,' log_count / INFO':7,' memusage / max':   31735808,' memusage / startup':31735808,' response_received_count':   2,' scheduler / dequeued':1,' scheduler / dequeued / memory':1,   ' scheduler / enqueued':1,' scheduler / enqueued / memory':1,   ' start_time&#39 ;: datetime.datetime(2017,10,6,22,42,28,816043)}   2017-10-07 00:42:29 [scrapy.core.engine]信息:蜘蛛关闭   (成品)

我一直在这里关注指南:http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html

更新1:

有人告诉我,我需要将我的代码更新为新标准,但我没有改变结果:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem

class BodyBuildingSpider(BaseSpider):
    name = "bodybuilding"
    allowed_domains = ["forum.bodybuilding.nl"]
    start_urls = [
        "https://forum.bodybuilding.nl/fora/supplementen.22/"
    ]

    def parse(self, response):
        for sel in response.css('li.past.line.event-item'):
            item = BodybuildingItem()
            yield {'title': title.css('a.data-previewUrl::text').extract_first()}
            yield item

上次更新并修复

经过一些好的帮助后,我终于让它与这只蜘蛛一起工作了:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'bodybuilding'
    start_urls = ['https://forum.bodybuilding.nl/fora/supplementen.22/']

    def parse(self, response):
        for title in response.css('h3.title'):
            yield {'title': title.css('a::text').extract_first()}
            next_page_url = response.xpath("//a[text()='Volgende >']/@href").extract_first()
            if next_page_url:
                 yield response.follow(next_page_url, callback=self.parse)

1 个答案:

答案 0 :(得分:1)

您应该使用response.css('li.past.line.event-item'),而不需要responseSelector = Selector(response)

您使用的li.past.line.event-item CSS也不再有效,因此您需要先根据最新网页更新这些内容

要获取下一页网址,您可以使用

>>> response.css("a.text::attr(href)").extract_first()
'fora/supplementen.22/page-2'

然后使用response.follow来关注此相对网址

编辑-2:下一页处理更正

之前的编辑无效,因为在下一页上它与上一页的网址相匹配,因此您需要在下面使用

next_page_url = response.xpath("//a[text()='Volgende >']/@href").extract_first()
if next_page_url:
   yield response.follow(next_page_url, callback=self.parse)

编辑-1:下一页处理

next_page_url = response.css("a.text::attr(href)").extract_first()
if next_page_url:
   yield response.follow(next_page_url, callback=self.parse)