我一直想知道使用scrapy废弃多级数据的最佳方法是什么 我将在四个阶段描述情况,
当前架构
第一页:艺术家名单
第二页:每位艺术家的专辑列表
第三页:每张专辑的歌曲列表
类MusicLibrary(蜘蛛): name =' MusicLibrary'
def parse(self, response):
items = Discography()
items['artists'] = []
for artist in artists:
item = Artist()
item['albums'] = []
item['artist_name'] = "name"
items['artists'].append(item)
album_page_url = "extract link to album and yield that page"
yield Request(album_page_url,
callback=self.parse_album,
meta={'item': items,
'artist_name': item['artist_name']})
def parse_album(self, response):
base_item = response.meta['item']
artist_name = response.meta['artist_name']
# this will search for the artist added in previous method and append album under that artist
artist_index = self.get_artist_index(base_item['artists'], artist_name)
albums = "some path selector"
for album in albums:
item = Album()
item['songs'] = []
item['album_name'] = "name"
base_item['artists'][artist_index]['albums'].append(item)
song_page_url = "extract link to song and yield that page"
yield Request(song_page_url,
callback=self.parse_song_name,
meta={'item': base_item,
"key": item['album_name'],
'artist_index': artist_index})
def parse_song_name(self, response):
base_item = response.meta['item']
album_name = response.meta['key']
artist_index = response.meta["artist_index"]
album_index = self.search(base_item['artists'][artist_index]['albums'], album_name)
songs = "some path selector "
for song in songs:
item = Song()
song_name = "song name"
base_item['artists'][artist_index]['albums'][album_index]['songs'].append(item)
# total_count (total songs to parse) = Main Artist page is having the list of total songs for each artist
# current_count(currently parsed) = i will go to each artist->album->songs->[] and count the length
# i will yield the base_item only when songs to scrape and song scraped count matches
if current_count == total_count:
yield base_item
困难以及为什么我认为必须有更好的选择
我尝试存储数据但失败然后成功部署的格式
任何人都可以在这里建议我做错了什么,或者如何在一些页面返回非200代码时使这更有效率并使其工作?
答案 0 :(得分:0)
上面代码的问题是:
解决方案是使用简单的 copy.deepcopy 并在回调方法中从response.meta对象创建一个新对象,而不是更改base_item对象
当我有时间时,会尝试解释完整的答案。