(Python 3)Spider必须返回Request,BaseItem,dict或None,得到'generator'

时间:2017-09-11 17:54:47

标签: python-3.x web-scraping scrapy

我正在制作一个scrapy脚本来从Paul Krugman的NYT博客中提取最新的博客文章。该项目正在顺利进行,但是当我到达实际尝试提取数据的阶段时,我仍然遇到同样的问题:

ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://krugman.blogs.nytimes.com/more_posts_jsons/page/1/?homepage=1&apagenum=1>

我正在使用的代码如下:

from scrapy import http
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
import scrapy
from tutorial.items import BlogPost


class krugSpider(CrawlSpider):
    name = 'krugbot'
    start_url = ['https://krugman.blogs.nytimes.com']

    def __init__(self):
        self.url = 'https://krugman.blogs.nytimes.com/more_posts_jsons/page/{0}/?homepage=1&apagenum={0}'

    def start_requests(self):
        yield http.Request(self.url.format('1'), callback = self.parse_page)

    def parse_page(self, response):
        data = json.loads(response.body)
        for block in range(len(data['posts'])):
            yield self.parse_block(data['posts'][block])

        page = data['args']['paged'] + 1
        url = self.url.format(str(page))
        yield http.Request(url, callback = self.parse_page)


    def parse_block(self, block):
        for content in block:
            article = BlogPost(author = 'Paul Krugman', source = 'Blog')

            paragraphs = Selector(text = content['html'])

                article['paragraphs']= paragraphs.xpath('article/p').extract()
                article['datetime'] = content['post_date']
                article['post_id'] = content['post_id']
                article['url'] = content['permalink']
                article['title'] = content['headline']

            yield article

作为参考,items.py文件是:

from scrapy import Item, Field

class BlogPost(Item):
    author = Field()
    source = Field()
    datetime = Field()
    url = Field()
    post_id = Field()
    title = Field()
    paragraph = Field()

程序应该返回scrapy'Item'类对象和非生成器,所以我不确定它为什么返回一个生成器。有什么建议吗?

2 个答案:

答案 0 :(得分:2)

这是因为你在parse_page内产生了一个生成器。检查这一行:

yield self.parse_block(data['posts'][block])

产生parse_block的输出,而parse_block返回generator(因此它也会产生多个对象)。

如果将其更改为:

,它应该有效
for block in range(len(data['posts'])):
    for article in self.parse_block(data['posts'][block]):
        yield article

答案 1 :(得分:2)

我相信您也可以使用 from time import sleep from threading import Thread class Task(Thread): i = 0 def run(self): while True: self.i += 1 sleep(1) t = Task() t.start() sleep(5) print(t.i) ,而不是像在接受的答案中那样迭代 self.parse_block(data['posts'][block]) 并产生每个项目:

yield from