我的网页抓取工具没有抓取所有评论和用户名

时间:2020-09-19 04:35:58

标签: python beautifulsoup python-requests

我写了一段代码,以删除reddit帖子上的所有评论和用户名,但是代码并没有删除所有内容,

可能是什么问题?

这是我的代码:-

import requests
from bs4 import BeautifulSoup

listt = []
count = 0
username_list = []
comment_list = []

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
url = input("Please input reddit url:")
page = requests.get(url, headers=headers)

old_page_url = "https://old"+url[11:]
old_page = requests.get(old_page_url,headers=headers)


soup = BeautifulSoup(page.text,"html.parser")
old_soup = BeautifulSoup(old_page.text,"html.parser")

comments = soup.findAll('div',{'data-test-id':'comment'})

for one_comment in comments:
    comment_list.append(one_comment.text)

for name in old_soup.find_all("a"):
    listt.append(name.text)


for item in listt:
    if item == '[–]':
        username_list.append(listt[count+1])
    count+=1


for i in range(len((comment_list))):
    print(f"Comment made by u/{username_list[i]} = {comment_list[i]}")

`

1 个答案:

答案 0 :(得分:0)

我的猜测是其他注释仅在您向下滚动时才会加载?

无论如何,如果您需要获取所有注释和用户名,请使用Reddit API本身:

https://www.reddit.com/comments/{thread-id}.json

例如:https://www.reddit.com/comments/iv5jaa.json将显示来自https://www.reddit.com/r/DeepRockGalactic/comments/iv5jaa/的所有评论。

使用JSON解析器进行操作:)。