错误:Spider必须返回Request,BaseItem,dict或None,在GET中设置为“ set”

时间:2019-08-13 10:54:13

标签: python web-scraping scrapy

我正在尝试从gogoanime1.com中的Url页面中包含“ watch /”的页面编制索引。以下内容以前仅在不同的网站上有效,但由于原因,此类错误在我的日志中[scrapy.core.scraper]错误:Spider必须返回Request,BaseItem,dict或None,并在https://www.gogoanime1.com/watch/cardfight-vanguard-g-next/episode/episode-48/1>中获得“设置”

而且所有数据都不在我的输出json

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class GogocrawlerSpider(CrawlSpider):
    name = 'gogocrawler'
    allowed_domains = ['gogoanime1.com']
    start_urls = ['http://gogoanime1.com/']

    rules = (
        Rule(LinkExtractor(allow=r'watch/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
         yield {response.url}

我的部分日志:[scrapy.core.scraper]错误:蜘蛛必须返回Request,BaseItem,dict或None,在https://www.gogoanime1.com/watch/cardfight-vanguard-link中设置为“ set” -小丑母鸡/情节/情节-1> 2019-08-13 16:26:16 [scrapy.core.scraper]错误:蜘蛛必须返回Request,BaseItem,dict或None,在https://www.gogoanime1.com/watch/cardfight-vanguard中获得了“设置” -link-joker-hen / episode / episode-2> 2019-08-13 16:26:16 [scrapy.core.scraper]错误:蜘蛛必须返回Request,BaseItem,dict或None,在https://www.gogoanime1.com/watch/cardfight-vanguard中获得了“设置” -g-next / episode / episode-43 / 1> 2019-08-13 16:26:16 [scrapy.core.engine]调试:爬行(200)https://www.gogoanime1.com/watch/cardfight-vanguard-g-next/episode/episode-44/1 >(引荐来源:https://www.gogoanime1.com/watch/cardfight-vanguard-g-next/episode/episode-44) 2019-08-13 16:26:16 [scrapy.core.scraper]错误:蜘蛛必须返回Request,BaseItem,dict或None,在https://www.gogoanime1.com/watch/cardfight-vanguard中获得了“设置” -link-joker-hen / episode / episode-4> 2019-08-13 16:26:16 [scrapy.core.scraper]错误:蜘蛛必须返回Request,BaseItem,dict或None,在https://www.gogoanime1.com/watch/cardfight-vanguard中获得了“设置” -link-joker-hen / episode / episode-5>

1 个答案:

答案 0 :(得分:1)

如错误日志中所述。你需要回来 请求,BaseItem,字典或无

所以这对你有用

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class GogocrawlerSpider(CrawlSpider):
    name = 'gogocrawler'
    allowed_domains = ['gogoanime1.com']
    start_urls = ['http://gogoanime1.com/']

    rules = (
        Rule(LinkExtractor(allow=r'watch/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
         yield {'url':response.url}

您现在应该在output.json中看到数据