Question

我正在尝试使用Pandas从Javascript网站上刮取一张桌子。为此，我使用Selenium首先到达我想要的页面。我能够以文本格式打印表格（如注释脚本所示），但我希望能够在Pandas中创建表格。我附上我的脚本如下，我希望有人可以帮我解决这个问题。

cells

运行上面的脚本，我得到以下错误：

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?
filter=BS02'

page = driver.get(url)
time.sleep(2)


driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()


driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)

driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)

target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
##    print (data.text)

for data in target:
    dfs = pd.read_html(target,match = '+')
for df in dfs:
    print (df)

我也尝试在网址上使用pd.read_html，但它返回了“No Table Found”错误。网址为：http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1。

Answer 1

您可以使用以下代码获取表格

import time
from selenium import webdriver
import pandas as pd

chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS02'

page = driver.get(url)
time.sleep(2)

df = pd.read_html(driver.page_source)[0]
print(df.head())

这是输出

No  Code    Name    Rem Last Done   LACP    Chg % Chg   Vol ('00)   Buy Vol ('00)   Buy Sell    Sell Vol ('00)  High    Low
0   1   5284CB  LCTITAN-CB  s   0.025   0.020   0.005   +25.00  406550  19878   0.020   0.025   106630  0.025   0.015
1   2   1201    SUMATEC [S] s   0.050   0.050   -   -   389354  43815   0.050   0.055   187301  0.055   0.050
2   3   5284    LCTITAN [S] s   4.470   4.700   -0.230  -4.89   367335  430 4.470   4.480   34  4.780   4.140
3   4   0176    KRONO [S]   -   0.875   0.805   0.070   +8.70   300473  3770    0.870   0.875   797 0.900   0.775
4   5   5284CE  LCTITAN-CE  s   0.130   0.135   -0.005  -3.70   292379  7214    0.125   0.130   50  0.155   0.100

要从所有网页获取数据，您可以抓取剩余的网页并使用df.append

Answer 2

答案： df = pd.read_html(target[0].get_attribute('outerHTML'))

结果：

目标[0]的原因： driver.find_elements_by_id('bm_equities_prices_table') returns a list of selenium webelement, in your case, there's only 1 element, hence [0]

get_attribute（'outerHTML'）的原因： we want to get the 'html' of the element. There are 2 types of such get_attribute methods: 'innerHTML' vs 'outerHTML'. We chose the 'outerHTML' becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.

df [0]的原因： ``` pd.read_html（）返回一个数据帧列表，第一个是我们想要的结果，因此[0]。

试图从Selenium的结果中使用Pandas刮掉桌子

2 个答案: