如何从通过JavaScript加载的页面上抓取数据

时间:2020-07-22 17:09:48

标签: python-3.x beautifulsoup

我想使用beautifulsoup在此页面上删除评论-https://www.x....s.com/video_id/the-suburl

通过Java单击可加载评论。注释是分页的,每个页面也会在单击时加载注释。我希望获取所有评论,对于每个评论,我想获取海报简介网址,评论,否。喜欢,没有喜欢和张贴时间(如页面上所述)。

评论可以是字典列表。

我该怎么办?

2 个答案:

答案 0 :(得分:2)

此脚本将打印在页面上找到的所有注释:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')

u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()

for id_ in comments['posts']['ids']:
    print(comments['posts']['posts'][id_]['date'])
    print(comments['posts']['posts'][id_]['name'])
    print(comments['posts']['posts'][id_]['url'])
    print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
    # ...etc.
    print('-'*80)

答案 1 :(得分:0)

这将通过硒完成。硒模拟浏览器。您可以根据自己的喜好使用chrome驱动程序或Firefox驱动程序(即geckodriver)。

以下是有关如何安装chrome webdriver的链接: http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/

然后在您的代码中设置它:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# this part may change depending on where you installed the webdriver. 
# You may have to define the path to the driver. 
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()

# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless') 

driver = webdriver.Chrome(options=chrome_options)

driver.get(your_url)
html = driver.page_source # downloads the html from the driver

Selenium具有多种功能,可用于执行某些操作,例如单击页面上的元素。找到含硒元素后,就可以使用.click()方法与该元素进行交互。 让我知道这是否有帮助

相关问题