使用Python 3.6.1运行代码
import requests
import pandas as pd
from bs4 import BeautifulSoup
# url_addr = "https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm"
url_addr = "https://www.cefconnect.com/closed-end-funds-daily-pricing"
html_text = requests.get(url_addr).content
bs_obj = BeautifulSoup(html_text)
tables = bs_obj.findAll('table')
dfs = list()
for table in tables:
df = pd.read_html(str(table))[0]
dfs.append(df)
print(df)
仅获取列标题,而不获取实际数据,并带有输出
Empty DataFrame
Columns: [Ticker, Fund Name, Strategy, ClosingPrice, PriceChange, NAV, Premium/Discount, DistributionRate, DistributionRate on NAV, 1 Yr Rtnon NAV]
Index: []
它适用于已注释掉的url_addr。
答案 0 :(得分:1)
第二个URL用Javascript填充表。如果您使用wget
或在Google Chrome浏览器的“网络”标签中查看,就会看到这是原始发送的表格(即,这就是看到的漂亮汤):
<div id="data-container" class="row-fluid">
<div class="span12">
<table class="cefconnect-table-1 daily-pricing table table-striped table-condensed" id="daily-pricing" width="100%" cellpadding="5" cellspacing="0" border="0" summary="">
<thead>
<tr>
<th class="ticker">Ticker</th>
<th class="fund-name">Fund Name</th>
<th class="strategy">Strategy</th>
<th class="closing-price">Closing<br />Price</th>
<th class="price-change">Price<br />Change</th>
<th class="nav">NAV</th>
<th class="premium-discount">Premium/<br />Discount</th>
<th class="distribution-rate">Distribution<br />Rate<sup>†</sup></th>
<th class="distribution-rate-on-nav">Distribution<br />Rate on NAV</th>
<th class="return-on-nav">1 Yr Rtn<br />on NAV</th>
</tr>
</thead>
<tbody></tbody>
</table>
</div>
</div>
然后,一些Java脚本填充表。您可以从此处获得两个选项,或者使用headless browser(例如PhantomJS,Selenium,有很多相对容易使用的选项)并在解析之前运行Javascript或尝试弄清楚如何访问该API该页面用于添加数据。
我一直想提到的另一种选择是与网站所有者联系,并做出安排以更直接的方式获取数据。