我正在尝试使用Pandas从Javascript网站上刮取一张桌子。为此,我使用Selenium首先到达我想要的页面。我能够以文本格式打印表格(如注释脚本所示),但我希望能够在Pandas中创建表格。我附上我的脚本如下,我希望有人可以帮我解决这个问题。
cells
运行上面的脚本,我得到以下错误:
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?
filter=BS02'
page = driver.get(url)
time.sleep(2)
driver.find_element_by_xpath('//*[@id="bursa_boards"]/option[2]').click()
driver.find_element_by_xpath('//*[@id="bursa_sectors"]/option[11]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[@id="bm_equity_price_search"]').click()
time.sleep(5)
target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
## print (data.text)
for data in target:
dfs = pd.read_html(target,match = '+')
for df in dfs:
print (df)
我也尝试在网址上使用pd.read_html,但它返回了“No Table Found”错误。网址为:http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT§or=PROPERTIES&page=1。
答案 0 :(得分:2)
您可以使用以下代码获取表格
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS02'
page = driver.get(url)
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
print(df.head())
这是输出
No Code Name Rem Last Done LACP Chg % Chg Vol ('00) Buy Vol ('00) Buy Sell Sell Vol ('00) High Low
0 1 5284CB LCTITAN-CB s 0.025 0.020 0.005 +25.00 406550 19878 0.020 0.025 106630 0.025 0.015
1 2 1201 SUMATEC [S] s 0.050 0.050 - - 389354 43815 0.050 0.055 187301 0.055 0.050
2 3 5284 LCTITAN [S] s 4.470 4.700 -0.230 -4.89 367335 430 4.470 4.480 34 4.780 4.140
3 4 0176 KRONO [S] - 0.875 0.805 0.070 +8.70 300473 3770 0.870 0.875 797 0.900 0.775
4 5 5284CE LCTITAN-CE s 0.130 0.135 -0.005 -3.70 292379 7214 0.125 0.130 50 0.155 0.100
要从所有网页获取数据,您可以抓取剩余的网页并使用df.append
答案 1 :(得分:1)
答案:
df = pd.read_html(target[0].get_attribute('outerHTML'))
结果:
目标[0]的原因:
driver.find_elements_by_id('bm_equities_prices_table') returns a list of selenium webelement, in your case, there's only 1 element, hence [0]
get_attribute('outerHTML')的原因:
we want to get the 'html' of the element. There are 2 types of such get_attribute methods: 'innerHTML' vs 'outerHTML'. We chose the 'outerHTML' becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.
df [0]的原因: ``` pd.read_html()返回一个数据帧列表,第一个是我们想要的结果,因此[0]。