Question

我正在使用bs4在khanacademy上抓取https://www.khanacademy.org/profile/DFletcher1990/一个用户个人资料。

我正在尝试获取用户统计数据（加入日期，获得能量点，完成视频）。

我有支票https://www.crummy.com/software/BeautifulSoup/bs4/doc/

似乎：“最常见的意外行为类型是您找不到文档中已知的标签。您看到它正在插入，但是find_all()返回[]或find()返回None。这是Python内置HTML解析器的另一个常见问题，有时会跳过它不理解的标签。同样，解决方案是安装lxml或html5lib。”

我尝试了不同的解析器方法，但是遇到了同样的问题。

from bs4 import BeautifulSoup
import requests

url = 'https://www.khanacademy.org/profile/DFletcher1990/'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

print(soup.find_all('div', class_='profile-widget-section'))

我的代码返回了[]。

Answer 1

页面内容是使用javascript加载的。检查内容是否动态的最简单方法是右键单击并查看页面源，然后检查内容是否存在。您也可以关闭浏览器中的javascript并转到url。

您可以使用selenium来获取内容

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://www.khanacademy.org/profile/DFletcher1990/")
element=WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]/div[2]/div/div[2]/table')))
source=driver.page_source
soup=BeautifulSoup(source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
for tr in user_info_table.find_all('tr'):
    tds=tr.find_all('td')
    print(tds[0].text,":",tds[1].text)

输出：

Date joined : 4 years ago
Energy points earned : 932,915
Videos completed : 372

另一个可用的选项（因为您已经熟悉了请求）是使用requests-html

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=10)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
for tr in user_info_table.find_all('tr'):
    tds=tr.find_all('td')
    print(tds[0].text,":",tds[1].text)

输出

Date joined : 4 years ago
Energy points earned : 932,915
Videos completed : 372

另一种选择是找出正在发出的ajax请求并进行仿真并解析响应。此响应不一定总是json。但是在这种情况下，内容不会通过ajax响应发送到浏览器。它已经存在于页面源中。

该页面仅使用javascript来构造此信息。您可以尝试从脚本标签中获取数据，这可能涉及一些正则表达式，然后从字符串中生成一个json。

找不到我知道在文档中的标签-find_all（）返回[]

1 个答案: