Question

我正在使用Pandas来解析以下页面中的数据：http://kenpom.com/index.php?y=2014

要获取数据，我写信：

dfs = pd.read_html(url)

数据看起来很棒并且被完美地解析，除了它只从第一行的40行中获取数据。这似乎是表格分离的一个问题，这使得大熊猫没有获得所有信息。

如何让pandas从该网页上的所有表中获取所有数据？

Answer 1

您发布的网页HTML包含多个<thead>和<tbody>代码，这些代码混淆了。

完成此pandas.read_html后，您可以手动SO thread这些标记：

import urllib
from bs4 import BeautifulSoup

html_table = urllib.request.urlopen(url).read()

# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}): 
    for c in table.children:
        if c.name in ['tbody', 'thead']:
            c.unwrap()

df = pd.read_html(str(soup), flavor="bs4")
len(df[0])

返回369。

使用Pandas从网页获取多个表

1 个答案: