我正在解析一个网站,我想提取一些关于作者的数据。因为一些作者写了很多文章,我将作者信息保存到数据库中。只有当我没有关于新作者的信息时,我才想抓取子站点来提取数据,然后将它们保存到数据库中。我已经阅读了很多关于如何从其他网站提取数据的文档,但我无法使其有效。我是一个相当新的Python,并没有得到返回和收益之间的所有差异。
这是我的代码的一部分:
def start_requests(self):
for url in self.getUrlsToCrawl():
yield self.buildRequest(url[1], url[0])
def buildRequest(self, url, dbid):
return SplashRequest(url, self.parse,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source':script, 'image':0},
meta={'dbId': dbid, 'originalUrl': url},
errback=self.errback_httpbin, dont_filter=True)
def parse(self, response):
[...]
articleLoader.add_value('authors', self.getAuthors(response))
[...]
return articleLoader.load_item()
def getAuthors(self, response):
authorsArray = []
authorName = remove_tags(response.xpath(myxpath).extract_first())
authorUrl = response.xpath(myxpath).extract_first()
authorInfos = self.executeSQL('SELECT name, twitter, email FROM author WHERE LOWER(name) = LOWER(%s) and domain = %s', (authorName, self.domain, ))
authorItem = AuthorItem()
if len(authorInfos) != 0:
authorItem['name'] = authorInfos[0][0]
authorItem['twitter'] = authorInfos[0][1]
authorItem['email'] = authorInfos[0][2]
elif authorUrl:
self.fetchAuthorInfos(authorUrl, authorItem)
else:
authorItem['name'] = authorName
authorsArray.append(dict(authorItem))
return authorsArray
def fetchAuthorInfos(self, url, authorItem):
return SplashRequest(url, callback = self.parseAuthorInfos, meta={'item':authorItem})
def parseAuthorInfos(self, response):
authorItem = response.meta['item']
authorItem['name'] = 'toto'
authorItem['twitter'] = 'titi'
authorItem['email'] = 'maimai'
return authorItem
我(随机)尝试了return和yield之间的不同组合。有时我会到达fetchAuthorInfos
,有时不会,但我从未到达parseAuthorInfos
。如果作者的信息在我的数据库中,或者如果我没有个人作者的URL来获取一切正常。
感谢您的帮助!
修改
感谢Granitosaurus,我找到了解决方法,但我仍然不是百分之百满意,因为有时我可以为一篇文章提供多位作者,我希望能够为每个人提取信息
def parse(self, response):
[...]
return self.get_authors(response, articleLoader)
def get_authors(self, response, articleLoader):
[...]
elif authorUrl:
return Request(authorUrl, callback = self.parse_author_infos, meta={'item':articleLoader})
else:
authorLoader.add_value('name', authorName)
authorsArray.append(dict(authorLoader.load_item()))
articleLoader.add_value('authors', authorsArray)
return articleLoader.load_item()
def parse_author_infos(self, response):
[...]
return articleLoader.load_item()
答案 0 :(得分:0)
这条线几乎没有任何作用:
self.fetchAuthorInfos(author_url, author_item)
该函数返回SplashRequest,但它没有以任何方式分配,返回或使用。
你在这里错过了连锁逻辑。您要做的是将请求链继续到作者的页面,例如:
def parse(self, response):
"""This parses article's page"""
article_loader = ArticleLoader(response)
# add some stuff to article loader
<...>
author_url = 'some_url'
if author_url:
# if there's url to author page scrape that url for authors info
return SplashRequest(url, callback=self.parse_author,
meta={'item': article_loader.load_item()})
# otherwise just scrape presented author info
author_item = dict()
author_item['name'] = 'foobar'
article_loader.add_value('author', author_item)
return article_loader.load_item()
def parse_author(self, response):
"""This parses author's page"""
article_loader = response.meta['item']
author_item['name'] = 'toto'
author_item['twitter'] = 'titi'
author_item['email'] = 'maimai'
aritcle_loader.add_value('author', author_item)
return article_loader