YouTube数据API可以抓取所有评论和回复

时间:2020-10-09 07:00:18

标签: python dataframe youtube web-crawler youtube-data-api

我一直在拼命寻找解决方案,以检索所有评论和相应的答复以进行研究。创建一个包含正确和相应顺序的注释数据的数据框非常困难。

我将在这里分享我的代码,以便您的专业人员查看并为我提供一些见识。

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
                    print(rauthor)
                    print(rtext)
            data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
            result = pd.DataFrame(data)
            result.to_csv('youtube.csv', mode='a',header=False)
            print(comment)
            print(comment2)
            print(comment3)
            print(comment4)
            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

执行此操作时,我的搜寻器会收集评论,但不会收集某些评论下的某些回复。

我如何使它收集评论及其相应的答复,并将它们放在单个数据框中?

更新

因此,我设法以某种方式在Jupyter Notebook的输出部分提取了所需的信息。我现在要做的就是将结果附加到数据框中。

这是我更新的代码:

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    print(rtext)
                    print(rtime)
                    print(rauthor)
                    print('Likes: ', rlike)
                    
            print(comment)
            print(comment2)
            print(comment3)
            print("Likes: ", comment4)

            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

结果是:

here

如您所见,在========行下分组的评论是评论和下面的相应回复。

将结果附加到数据框中的好方法是什么?

1 个答案:

答案 0 :(得分:1)

根据官方文档,replies.comments[]资源的属性CommentThreads具有以下规范:

replies.comments[](列表)
对顶级评论的一个或多个回复列表。列表中的每个项目都是comment资源。

该列表包含有限数量的答复,并且除非列表中的项目数等于snippet.totalReplyCount属性的值,否则答复列表仅是可用于答复的总数的一部分。顶级评论。要检索对顶级注释的所有答复,您需要调用Comments.list方法并使用parentId请求参数来标识要为其检索答复的注释。

因此,如果要获取与给定顶级评论关联的所有答复条目,则必须使用经过适当查询的Comments.list API端点。

我建议您阅读my answer to a very much related question;共有三个部分:

  • 顶级评论和相关回复
  • 属性nextPageToken和参数pageToken ,以及
  • 设计施加的API限制

首先,您必须承认,当这些评论的数量超过特定(未指定)上限时,API(当前实施)不允许获取与给定视频相关的所有顶级评论绑定。


对于与Python实现有关的问题,我建议您按以下方式构建代码:

def get_video_comments(service, video_id):
    request = service.commentThreads().list(
        videoId = video_id,
        part = 'id,snippet,replies',
        maxResults = 50
    )
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            ...

        request = service.commentThreads().list_next(
            request, response)

    return comments
def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 50
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies

请注意,...上方的省略号必须替换为实际代码,该代码填充get_video_comments返回给调用者的结构数组。

最简单的方法(用于快速测试)是将...替换为comments.append(comment),然后将get_video_comments的调用者简单地打印出来(使用json.dump)从该函数获得的对象。