我想创建一个实现scrapy CrawlSpider的ExampleSpider。我的ExampleSpider应该能够处理只包含艺术家信息的页面, 仅包含专辑信息的页面,以及包含专辑和艺术家信息的其他一些页面。
我能够处理前两个场景。但问题发生在第三种情况。我正在使用parse_artist(response)
方法处理艺术家数据,parse_album(response)
方法来处理相册数据。
我的问题是,如果一个页面同时包含艺术家和专辑数据,我该如何定义我的规则?
还有其他办法吗? (适当的方式)
class ExampleSpider(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
# more rules .....
]
def parse_artist(self, response):
artist_item = ArtistItem()
try:
# do the scrape and assign to ArtistItem
except Exception:
# ignore for now
pass
return artist_item
pass
def parse_album(self, response):
album_item = AlbumItem()
try:
# do the scrape and assign to AlbumItem
except Exception:
# ignore for now
pass
return album_item
pass
pass
答案 0 :(得分:8)
CrawlSpider
调用_requests_to_follow()
方法来提取网址并生成要遵循的请求:
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
如你所见:
seen
存储urls
。 url
最多只能解析一个callback
。 您可以定义parse_item()
来呼叫parse_artist()
和parse_album()
:
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
# more rules .....
]
def parse_item(self, response):
yield self.parse_artist(response)
yield self.parse_album(response)