Question

我正在尝试使用append（）函数生成或返回一个列表，但是我遇到了一些错误。

有解决方法吗？我评论了一些我得到的错误。

很抱歉，但我是python编码的新手。

class mySpider(CrawlSpider):
    name = "testspider"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/test-page.html',
    )

    rules = (
        Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
    )

    def parse_items(self, response):
        item = myItem()

        #Extract some items
        item['status'] = response.status
        yield item

        inlinks = []
        links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
        for link in links:
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                if allowed_domain in link.url:
                    is_allowed = True
            if is_allowed:
                inlink = anotherItem()
                inlink['url_from'] = response.url
                inlink['url_to'] = link.url
                inlinks.append(inlink)
        yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
        #if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page 
        #using return inlinks I get

Answer 1

yield一次一项。无需创建列表以在当时返回它。

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink

Answer 2

错误消息很明显，Spider必须返回Request，BaseItem，dict或None。

但是你要返回list（在PHP中称为数组）

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            yield inlink

您可以使用此代码来防止任何错误，只需一次产生1个项目。

或者即使您想一次返回/收益所有项目，也可以这样做

    for link in links:
        is_allowed = False
        for allowed_domain in self.allowed_domains:
            if allowed_domain in link.url:
                is_allowed = True
        if is_allowed:
            inlink = anotherItem()
            inlink['url_from'] = response.url
            inlink['url_to'] = link.url
            inlinks.append(inlink)
    yield {'all_links': inlinks}

Scrapy：如何产生与append（）连接的列表

2 个答案: