我正在尝试解析此网站链接中的评论: I need to get 1000 comments, by default it shows only 10
我想获得1000条评论,默认只显示10条评论。在点击“查看更多”
后,我无法找到获取网页上显示内容的方法我现在有以下代码:
import urllib.request
from bs4 import BeautifulSoup
import sys
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
response = urllib.request.urlopen("https://www.mygov.in/group-issue/share-
your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/")
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
all_comments=[]
for div in all_comments_div:
all_comments.append(div.find('p').text.translate(non_bmp_map))
print (all_comments)
print (len(all_comments))
答案 0 :(得分:1)
您可以使用while循环来获取下一页 (即,当有下一页并且所有评论都小于1000时)
import urllib.request
from bs4 import BeautifulSoup
import sys
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = []
max_comments = 1000
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
all_comments.append(div.find('p').text.translate(non_bmp_map))
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
print(len(all_comments))
答案 1 :(得分:1)
新评论是通过ajax
加载的,我们需要解析它然后使用bs
,即:
import json
import requests
import sys
from bs4 import BeautifulSoup
how_many_pages = 5 # how many comments pages you want to parse?
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = []
for x in range(how_many_pages):
# note: mygov.in seems very slow...
json_data = requests.get(
"https://www.mygov.in/views/ajax/?view_name=view_comments&view_display_id=block_2&view_args=267721&view_path=node%2\
F267721&view_base_path=comment_pdf_export&view_dom_id=f3a7ae636cabc2c47a14cebc954a2ff0&pager_element=1&sort_by=created&sort_order=DESC&page=0,{}"\
.format(x)).content
d = json.loads(json_data.decode()) # Remove .decode() for python < 3
print(len(d))
if len(d) == 3: # sometimes json lenght is 3
comments = d[2]['data'] # data is the key that contains the comments html
elif len(d) == 2: # others just 2...
comments = d[1]['data']
#From here, we can use your BeautifulSoup code.
soup = BeautifulSoup(comments, "html.parser")
all_comments_div = soup.find_all('div', class_="comment_body");
for div in all_comments_div:
all_comments.append(div.find('p').text.translate(non_bmp_map))
print(all_comments)
<强>输出强>:
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession,...']