如何使用Pandas read_html和请求库来读取表?

时间:2013-11-14 16:45:04

标签: python-2.7 pandas python-requests

我如何在以下方面削减基金的价格:

http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U

这是错误的,但我该如何修改它:

import pandas as pd
import requests
import re
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")})

2 个答案:

答案 0 :(得分:2)

我喜欢lxml来解析和查询HTML。这就是我想出的:

import requests
from lxml import etree

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U'
doc = requests.get(url)
tree = etree.HTML(doc.content)

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]'

rows = tree.xpath(row_xpath)

for row in rows:
    (date_string, v1, v2) = (td.text for td in row.getchildren())
    print "%s - %s - %s" % (date_string, v1, v2)

答案 1 :(得分:1)

我的解决方案与您的解决方案类似:

import pandas as pd
import requests
from lxml import etree

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U"
r = requests.get(url)
html = etree.HTML(r.content)
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()')

if len(data) % 3 == 0:
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask'])
    df = df.set_index('date')
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y')
    df.sort_index(inplace = True)