使用Python在新闻网站上抓取评论。愿意将评论隐藏在“显示更多”下

时间:2020-08-01 22:00:32

标签: python web-scraping

我正在尝试在新闻文章下抓取评论,以根据这些评论创建语言模型。

我已经成功地删除了评论部分,但是当“显示更多评论”按钮下隐藏了评论时,我遇到了麻烦。 Here是参考站点,在冰岛语中,有两种类型的“显示更多”按钮。

首先,该按钮会以冰岛语显示另外的X条评论:HlaðaXummæliaðauki

第二,该按钮在给定的评论线程中以冰岛语载入X个评论:Sýna10svöraðaukiíþessumþræði

这是我的代码atm。对此问题的任何提示都将受到高度赞赏!

import requests
from jsonfinder import jsonfinder
import json
import lxml.html
import re
from bs4 import BeautifulSoup

url = 'https://www.visir.is/g/20201996612d?fbclid=IwAR2wg5dBj0ZyjmQbJBDwyOx1PNS1spS2bYAXEQmomcOa93Hsfe_8SE_Hrxo'

pattern = re.compile("ReactRenderer")

FB_COMMENT_PLUGIN_URL = "https://www.facebook.com/plugins/feedback.php"

r = requests.get(url)
root = lxml.html.fromstring(r.text)

# pick up the api_key:
api_key = root.xpath('/html/head/meta[@property="fb:app_id"][1]/@content')[0]
og_url = root.xpath('/html/head/meta[@property="og:url"][1]/@content')[0]
print("Api-key:", api_key)
print("Og-url:", og_url)
print()
payload = {"api_key": api_key, "href": og_url}

r = requests.get(FB_COMMENT_PLUGIN_URL, params=payload)
print(r.url)
print()
for _start, _end, obj in jsonfinder(r.text):
    if obj is None:
        continue
    else:
        if "require" in obj:
            for x in obj["require"]:
                matched = pattern.search(str(x))
                if matched:
                    comments_json = x[3][0]['props']['comments']['idMap']


resutls = {'url': url, 'title': '', 'comments':{}}

keys_for_title:list = ['id', 'name', 'uri', 'type']
keys_for_comments:list = ['id', 'authorID', 'body', 'ranges', 'timestamp', 'targetID', 'ogURL', 'likeCount', 'hasLiked', 'canLike', 'canEdit', 'hidden', 'highlightedWords', 'reportURI', 'spamCount', 'canEmbed', 'type']

increment=1
for key, value in comments_json.items():
    
    #We try match a pattern of keys to a given pattern to find each section
    if all(item in value.keys() for item in keys_for_title) and len(value.keys())==len(keys_for_title):
        resutls["title"] = value['name']
    
    
    if all(item in value.keys() for item in keys_for_comments) and len(value.keys())==len(keys_for_comments):
        comments:dict = {}
        comments['text'] = value['body']['text']
        comments['likes'] = value['likeCount']

        authorID = value['authorID']
        for k, v in comments_json.items():
            if v['id'] == authorID:
                comments['name'] = v['name']
               

        resutls['comments'][increment] = comments
        increment += 1

if comments_json:
    print(json.dumps(resutls, indent=4, ensure_ascii=False))

1 个答案:

答案 0 :(得分:0)

您的方法存在的问题是无法执行客户端javascript,这是在常规浏览器中单击“显示更多...”按钮时通常会发生的情况。要超越这一点,您需要添加一些可以处理该javascript的东西。一种常见的实现方法是使用Selenium之类的浏览器自动化框架加载页面,然后模拟按钮单击。