我正在尝试使用append()函数生成或返回一个列表,但是我遇到了一些错误。
有解决方法吗?我评论了一些我得到的错误。
很抱歉,但我是python编码的新手。
class mySpider(CrawlSpider):
name = "testspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/test-page.html',
)
rules = (
Rule(LinkExtractor(allow=('')), callback="parse_items", follow=True),
)
def parse_items(self, response):
item = myItem()
#Extract some items
item['status'] = response.status
yield item
inlinks = []
links = LinkExtractor(canonicalize=False, unique=True).extract_links(response)
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
inlinks.append(inlink)
yield inlinks #ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://www.example.com/test-page.html>
#if using yield inlink of course I get just the first element, in my case I get only the first URL for every unique page
#using return inlinks I get
答案 0 :(得分:-1)
yield
一次一项。无需创建列表以在当时返回它。
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
yield inlink
答案 1 :(得分:-1)
错误消息很明显,Spider必须返回Request
,BaseItem
,dict
或None
。
但是你要返回list
(在PHP中称为数组)
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
yield inlink
您可以使用此代码来防止任何错误,只需一次产生1个项目。
或者即使您想一次返回/收益所有项目,也可以这样做
for link in links:
is_allowed = False
for allowed_domain in self.allowed_domains:
if allowed_domain in link.url:
is_allowed = True
if is_allowed:
inlink = anotherItem()
inlink['url_from'] = response.url
inlink['url_to'] = link.url
inlinks.append(inlink)
yield {'all_links': inlinks}