使用 BeautifulSoup 抓取动态加载的表格

时间:2021-04-01 10:35:32

标签: python beautifulsoup python-requests

我的代码可以返回前两个标签的值,但后面的不会在每个标签中。

HTML: enter image description here

我的代码: 将 bs4 导入为 bs 导入请求

resp = requests.get('https://q.stock.sohu.com/cn/bk_4401.shtml')
resp.encoding = 'gb2312'
soup = bs.BeautifulSoup(resp.text, 'lxml')
tab_sgtsc_list = soup.find('table').find('tbody').find_all('tr')

for tab_sgtsc in tab_sgtsc_list:
    print('**************************************')
    print(tab_sgtsc.find_all('td')[0].text)
    print(tab_sgtsc.find_all('td')[1].text)
    print(tab_sgtsc.find_all('td')[2].text)
    print(tab_sgtsc.find_all('td')[3].text)
    print('**************************************')

结果: enter image description here

1 个答案:

答案 0 :(得分:2)

表格由 JavaScript 动态呈现,因此您不会从纯 HTML 中获得太多。

然而,seleniumpandas 来救援了!

必填:

方法如下:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

driver.get("https://q.stock.sohu.com/cn/bk_4401.shtml")

wait = WebDriverWait(driver, 10)
element = wait.until(
    EC.visibility_of_element_located((By.CSS_SELECTOR, 'table.tableMSB'))
).text.replace("点击按代码排序查询", "").split()

table = [element[i:i + 12] for i in range(0, len(element), 12)]
pd.DataFrame(table[1:], columns=table[0]).to_csv("your_table_data.csv", index=False)

输出:

enter image description here