我想使用python
或R
提取以下链接的基金价格:
http://www.mpf.invesco.com.hk/html/en/mpf/prices.html
但每次我在浏览器中加载页面时,它会将我重定向到下面的页面,以确认我已经阅读了重要信息,然后才能获得基金价格。
http://www.mpf.invesco.com.hk/html/en/mpf/information.html
我想那个重要的信息页面'是由javascript制作的。我可以使用R
或python
确认已阅读重要信息,并让它检索后续页面的基金价格吗?
答案 0 :(得分:1)
情况稍微简单一些。您需要的表格是“坐在”从this url加载的iframe
内。
以下是使用requests
获取并使用BeautifulSoup
进行解析的方法:
from bs4 import BeautifulSoup
import requests
URL = 'https://apps.ap.invesco.com/invee/fund_info/fund_price_ns_mpf.do?version=en&haaccount=N&url=http://www.mpf.invesco.com.hk/html/pdf/factsheets/mpf'
response = requests.get(URL)
soup = BeautifulSoup(response.content)
table = soup.find_all('table')[1]
# getting the first row for example
print table.tr.text.strip()
打印:
Valuation Date: 10/07/2014
仅供参考,此处selenium
和真实浏览器不需要。
答案 1 :(得分:1)
使用RSelenium
和phantomjs
:
# use dev version so we can run phantomjs without a selenium server
# devtools::install_github("ropensci/RSelenium")
# it is necessary that phantomjs is in your PATH if not
# refer to package vignettes
library(RSelenium)
appURL <- "http://www.mpf.invesco.com.hk/html/en/mpf/prices.html"
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
# <span onclick=\"accept();return false;\">I have read the Important Information</span>
# execute above code
remDr$executeScript("accept();return false;")
# switch to iframe element
remDr$switchToFrame("myFrame")
> head(readHTMLTable(remDr$getPageSource()[[1]]
, which = 2, header = TRUE, skip.rows = 1))
Name of Constituent Fund Unit Class Currency
1 Hong Kong and China Equity Fund A HKD
2 Asian Equity Fund A HKD
3 Growth Fund A HKD
4 Balanced Fund A HKD
5 RMB Bond Fund (this Constituent Fund is denominated in HKD only and not in RMB) A HKD
6 Capital Stable Fund A HKD
Fund Price
1 34.5537
2 10.2323
3 19.2199
4 18.8244
5 9.8299
6 18.3871
最后完成后关闭phantomjs
实例:
pJS$stop()