我正在尝试从交易所的网页中获取一张桌子,以便可以使用它。理想情况下,寻找某种Matrix变量(dataframe ??)以使其易于使用。 但是,到目前为止,我仍然坚持解析html表本身。这是代码。...
from lxml import etree
from urllib.request import Request, urlopen
import requests
SYMBOL = "NIFTY"
URL = "https://www.nseindia.com/live_market/dynaContent/live_watch /option_chain/optionKeys.jsp?symbol=" + SYMBOL + "&date=-"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req =Request(url=URL, headers=headers)
Opt_Page = urlopen(req).read()
#print(Opt_Page)
html = etree.HTML(Opt_Page)
tr_nodes = html.xpath('//table[@id="octable"]/tr')
tmp = tr_nodes[0].xpath("th") #herein begins the problem.
# this give totally blank output.. tried with node[0] to [20]
print(tmp)
## 'th' is inside first 'tr'
header = [i[1].text for i in tr_nodes[1].xpath("th")]
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
print(header) # all headers are empty
print(td_content) # all content is empty
期待行标题和单个行内容作为输出...
答案 0 :(得分:0)
您可以安装pandas
库pip install pandas
以及相关的依赖项(可能是pip install lxml
),然后使用DataFrame:
from pandas import read_html
html = """
<table>
<tr>
<th>First</th>
<th>Last</th>
</tr>
<tr>
<td>John</td>
<td>Smith</td>
</tr>
<tr>
<td>Jane</td>
<td>Doe</td>
</tr>
</table>
"""
tables = read_html(html)
df = tables[0]
print(df)
print('----------')
print(df['Last'][0])
# Prints the following:
#
# First Last
# 0 John Smith
# 1 Jane Doe
# ----------
# Smith