我正在尝试在新闻文章下抓取评论,以根据这些评论创建语言模型。
我已经成功地删除了评论部分,但是当“显示更多评论”按钮下隐藏了评论时,我遇到了麻烦。 Here是参考站点,在冰岛语中,有两种类型的“显示更多”按钮。
首先,该按钮会以冰岛语显示另外的X条评论:HlaðaXummæliaðauki 。
第二,该按钮在给定的评论线程中以冰岛语载入X个评论:Sýna10svöraðaukiíþessumþræði。
这是我的代码atm。对此问题的任何提示都将受到高度赞赏!
import requests
from jsonfinder import jsonfinder
import json
import lxml.html
import re
from bs4 import BeautifulSoup
url = 'https://www.visir.is/g/20201996612d?fbclid=IwAR2wg5dBj0ZyjmQbJBDwyOx1PNS1spS2bYAXEQmomcOa93Hsfe_8SE_Hrxo'
pattern = re.compile("ReactRenderer")
FB_COMMENT_PLUGIN_URL = "https://www.facebook.com/plugins/feedback.php"
r = requests.get(url)
root = lxml.html.fromstring(r.text)
# pick up the api_key:
api_key = root.xpath('/html/head/meta[@property="fb:app_id"][1]/@content')[0]
og_url = root.xpath('/html/head/meta[@property="og:url"][1]/@content')[0]
print("Api-key:", api_key)
print("Og-url:", og_url)
print()
payload = {"api_key": api_key, "href": og_url}
r = requests.get(FB_COMMENT_PLUGIN_URL, params=payload)
print(r.url)
print()
for _start, _end, obj in jsonfinder(r.text):
if obj is None:
continue
else:
if "require" in obj:
for x in obj["require"]:
matched = pattern.search(str(x))
if matched:
comments_json = x[3][0]['props']['comments']['idMap']
resutls = {'url': url, 'title': '', 'comments':{}}
keys_for_title:list = ['id', 'name', 'uri', 'type']
keys_for_comments:list = ['id', 'authorID', 'body', 'ranges', 'timestamp', 'targetID', 'ogURL', 'likeCount', 'hasLiked', 'canLike', 'canEdit', 'hidden', 'highlightedWords', 'reportURI', 'spamCount', 'canEmbed', 'type']
increment=1
for key, value in comments_json.items():
#We try match a pattern of keys to a given pattern to find each section
if all(item in value.keys() for item in keys_for_title) and len(value.keys())==len(keys_for_title):
resutls["title"] = value['name']
if all(item in value.keys() for item in keys_for_comments) and len(value.keys())==len(keys_for_comments):
comments:dict = {}
comments['text'] = value['body']['text']
comments['likes'] = value['likeCount']
authorID = value['authorID']
for k, v in comments_json.items():
if v['id'] == authorID:
comments['name'] = v['name']
resutls['comments'][increment] = comments
increment += 1
if comments_json:
print(json.dumps(resutls, indent=4, ensure_ascii=False))
答案 0 :(得分:0)
您的方法存在的问题是无法执行客户端javascript,这是在常规浏览器中单击“显示更多...”按钮时通常会发生的情况。要超越这一点,您需要添加一些可以处理该javascript的东西。一种常见的实现方法是使用Selenium之类的浏览器自动化框架加载页面,然后模拟按钮单击。