Scrapy:如何产生与append()连接的列表

时间:2017-02-24 18:12:25

标签: python python-2.7 scrapy scrapy-spider

我正在尝试使用append()函数生成或返回一个列表,但是我遇到了一些错误。

有解决方法吗?我评论了一些我得到的错误。

很抱歉,但我是python编码的新手。

class mySpider(CrawlSpider):
    name = "testspider"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/test-page.html',
    )

    rules = (
        Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
    )

    def parse_items(self, response):
        item = myItem()

        #Extract some items
        item['status'] = response.status
        yield item

        inlinks = []
        links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
        for link in links:
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                if allowed_domain in link.url:
                    is_allowed = True
            if is_allowed:
                inlink = anotherItem()
                inlink['url_from'] = response.url
                inlink['url_to'] = link.url
                inlinks.append(inlink)
        yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
        #if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page 
        #using return inlinks I get 

2 个答案:

答案 0 :(得分:-1)

yield一次一项。无需创建列表以在当时返回它。

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink

答案 1 :(得分:-1)

错误消息很明显,Spider必须返回RequestBaseItemdictNone

但是你要返回list(在PHP中称为数组)

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink 

您可以使用此代码来防止任何错误,只需一次产生1个项目。

或者即使您想一次返回/收益所有项目,也可以这样做

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            inlinks.append(inlink)
    yield {'all_links': inlinks}