Question

我一直想知道使用scrapy废弃多级数据的最佳方法是什么我将在四个阶段描述情况，

我追随的当前架构来抓取这些数据
基本代码结构
困难以及为什么我认为必须有更好的选择
我尝试存储数据但失败然后成功

当前架构

数据结构

第一页：艺术家名单

第二页：每位艺术家的专辑列表

第三页：每张专辑的歌曲列表

基本代码结构

类MusicLibrary（蜘蛛）： name =＆＃39; MusicLibrary＆＃39;

def parse(self, response):

    items = Discography()
    items['artists'] = []
    for artist in artists:
            item = Artist()
            item['albums'] = []
            item['artist_name'] = "name"
            items['artists'].append(item)
            album_page_url = "extract link to album and yield that page"
            yield Request(album_page_url,
                          callback=self.parse_album,
                          meta={'item': items,
                                'artist_name': item['artist_name']})

def parse_album(self, response):
    base_item = response.meta['item']
    artist_name = response.meta['artist_name']
    # this will search for the artist added in previous method and append album under that artist
    artist_index = self.get_artist_index(base_item['artists'], artist_name)
    albums = "some path selector"
    for album in albums:
        item = Album()
        item['songs'] = []
        item['album_name'] = "name"
        base_item['artists'][artist_index]['albums'].append(item)
        song_page_url = "extract link to song and yield that page"
        yield Request(song_page_url,
                      callback=self.parse_song_name,
                      meta={'item':  base_item,
                            "key": item['album_name'],
                            'artist_index': artist_index})

def parse_song_name(self, response):
    base_item = response.meta['item']
    album_name = response.meta['key']
    artist_index = response.meta["artist_index"]
    album_index = self.search(base_item['artists'][artist_index]['albums'], album_name)
    songs = "some path selector "

    for song in songs:
        item = Song()
        song_name = "song name"
        base_item['artists'][artist_index]['albums'][album_index]['songs'].append(item)
        # total_count (total songs to parse) = Main Artist page is having the list of total songs for each artist
        # current_count(currently parsed) = i will go to each artist->album->songs->[] and count the length

        # i will yield the base_item only when songs to scrape and song scraped count matches
        if current_count == total_count:
            yield base_item

困难以及为什么我认为必须有更好的选择
- 目前我只是在抓取所有页面和子页面的情况下才会产生项目对象，条件是要刮取的歌曲和歌曲的数量匹配...
- 但是给出了刮擦的性质和刮擦量...有一些页面可以给我以外的代码（200状态确定）并且这些歌曲不会被刮掉而且项目数量不匹配
- 所以最后，即使90％的页面被成功抓取并且计数不匹配也不会产生任何影响，所有CPU功率都会丢失..
我尝试存储数据但失败然后成功部署的格式
- 我想要单行格式的每个项目对象的数据即artistName-Albumname-song name 所以如果艺术家A有1张专辑（aa）和8首歌曲... 8件商品将是每首歌曲有一个条目（项目）的商品
- 但是当我在最后一个函数中尝试屈服时使用当前格式＆＃34; parse_song_name＆＃34;它每次都会产生这种复杂的结构，每次都会增加对象......
- 然后我认为将所有内容添加到第一个唱片 - ＆gt;艺术家然后艺术家 - >专辑，然后专辑 - ＆gt;歌曲是问题，但当我删除追加并尝试没有它我只是产生一个对象，这是最后一个不是全部..
- 所以最后，如前所述开发了这项工作，但每次都不起作用（如果没有200状态代码）
- 当它工作时，在屈服之后，我写了一个pipline，我再次解析这个jSON并将其存储在我最初想要的数据格式中（每首歌一行 - 平面结构）

任何人都可以在这里建议我做错了什么，或者如何在一些页面返回非200代码时使这更有效率并使其工作？

Answer 1

上面代码的问题是：

可变对象（list，dict）：并且所有回调都在改变每个循环中的相同对象因此...在最后一个循环中覆盖了第一和第二级数据（mp3_son_url） ......（这是我失败的尝试）

解决方案是使用简单的 copy.deepcopy 并在回调方法中从response.meta对象创建一个新对象，而不是更改base_item对象

会尝试解释完整的答案。

使用Scrapy刮取多级数据，最佳方式

1 个答案: