如何使用python中的BeautifulSoup库从具有“查看更多”选项的网站获取数据

时间:2017-04-22 14:31:06

标签: python web-scraping beautifulsoup

我正在尝试解析此网站链接中的评论: I need to get 1000 comments, by default it shows only 10

我想获得1000条评论,默认只显示10条评论。在点击“查看更多”

后,我无法找到获取网页上显示内容的方法

我现在有以下代码:

import urllib.request
from bs4 import BeautifulSoup
import sys

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

response = urllib.request.urlopen("https://www.mygov.in/group-issue/share-
your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/")

srcode = response.read()

soup = BeautifulSoup(srcode, "html.parser")

all_comments_div=soup.find_all('div', class_="comment_body");

all_comments=[]
for div in all_comments_div:
    all_comments.append(div.find('p').text.translate(non_bmp_map))



print (all_comments)
print (len(all_comments))

2 个答案:

答案 0 :(得分:1)

您可以使用while循环来获取下一页 (即,当有下一页并且所有评论都小于1000时)

import urllib.request
from bs4 import BeautifulSoup
import sys

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = [] 
max_comments = 1000
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'

while next_page and len(all_comments) < max_comments : 
    response = response = urllib.request.urlopen(next_page)
    srcode = response.read()
    soup = BeautifulSoup(srcode, "html.parser")

    all_comments_div=soup.find_all('div', class_="comment_body");
    for div in all_comments_div:
        all_comments.append(div.find('p').text.translate(non_bmp_map))

    next_page = soup.find('li', class_='pager-next first last')
    if next_page : 
        next_page = base_url + next_page.find('a').get('href')
    print('comments: {}'.format(len(all_comments)))

print(all_comments)
print(len(all_comments))

答案 1 :(得分:1)

新评论是通过ajax加载的,我们需要解析它然后使用bs,即:

import json
import requests
import sys
from bs4 import BeautifulSoup

how_many_pages = 5 # how many comments pages you want to parse?
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = []

for x in range(how_many_pages):
    # note: mygov.in seems very slow...
    json_data = requests.get(
        "https://www.mygov.in/views/ajax/?view_name=view_comments&view_display_id=block_2&view_args=267721&view_path=node%2\
F267721&view_base_path=comment_pdf_export&view_dom_id=f3a7ae636cabc2c47a14cebc954a2ff0&pager_element=1&sort_by=created&sort_order=DESC&page=0,{}"\
            .format(x)).content
    d = json.loads(json_data.decode()) # Remove .decode() for python < 3
    print(len(d))
    if len(d) == 3: # sometimes json lenght is 3 
        comments = d[2]['data'] # data is the key that contains the comments html
    elif len(d) == 2: # others just 2...
        comments = d[1]['data']

    #From here, we can use your BeautifulSoup code.  
    soup = BeautifulSoup(comments, "html.parser")
    all_comments_div = soup.find_all('div', class_="comment_body");

    for div in all_comments_div:
        all_comments.append(div.find('p').text.translate(non_bmp_map))


print(all_comments)

<强>输出

["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession,...']