在scrapy python中获得yield的错误

时间:2012-12-12 05:05:19

标签: python scrapy

我有这个代码。当我使用yield来请求更多链接时,我得到这个错误

Spider must return Request, BaseItem or None, got 'dict' 

我已经尝试了一切,但我无法摆脱错误

代码在这里

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select("//li[contains(concat(' ', @class, ' '), ' mod-searchresult-entry ')]")
    items = []


    for site in sites[:2]:

        item = SeekItem()
        item['title'] = myfilter(site.select('dl/dd/h2/a').select("string()").extract())
        item['link_url'] = myfilter(site.select('dl/dd/h2/em').select("string()").extract())
        item['description'] = myfilter(site.select('dl/dd/p').select("string()").extract())
        if  item['link_url']:
                      yield Request(urljoin('http://www.seek.com.au/', item['link_url']),
                      meta = item,
                      callback = self.parseItemDescription)

        yield item

def parseItemDescription(self, response):

    item = response.meta
    hxs = HtmlXPathSelector(response)
    sites = hxs.select("//li[contains(concat(' ', @class, ' '), ' mod-searchresult-entry ')]")
    item['description'] = "mytest"

    return item

2 个答案:

答案 0 :(得分:4)

您使用的是哪种版本的scrapy? 0.16.2的文档使用passing items to another callback的方法。

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select("//li[contains(concat(' ', @class, ' '), ' mod-searchresult-entry ')]")
    items = []

    for site in sites[:2]:
        item = SeekItem()
        item['title'] = myfilter(site.select('dl/dd/h2/a').select("string()").extract())
        item['link_url'] = myfilter(site.select('dl/dd/h2/em').select("string()").extract())
        item['description'] = myfilter(site.select('dl/dd/p').select("string()").extract())
        if item['link_url']:
            request = Request("http://www.example.com/some_page.html", callback=self.parseItemDescription)
            request.meta['item'] = item
            return request

def parseItemDescription(self, response):

    item = response.meta['item']
    hxs = HtmlXPathSelector(response)
    sites = hxs.select("//li[contains(concat(' ', @class, ' '), ' mod-searchresult-entry ')]")
    item['description'] = "mytest"

    return item

注意:这是未经测试的,因为其余代码(spider,items.py等)丢失了,我不确定这是如何运行的

答案 1 :(得分:3)

两次收益

你屈服两次 - 第一次是请求;第二个是 dic 。 ( yield Request(...) yield item

我猜第二次是不必要的,应该删除。试试并在下面评论。 (删除显示产量项的行)