Question

我试图从篮球参考表（http://www.basketball-reference.com/leagues/NBA_2015_per_poss.html）中获取所有数据。当我使用XPath获取数据时，它会作为一个长列表出现。我有一个＆＃34;块＆＃34;将列表分成多个列表的方法，但是，由于表中有空单元格，该方法会关闭并错误地划分列表。有什么方法可以解决这个问题吗？

Answer 1

我的建议：使用pandas.DataFrame。它可以从许多来源加载数据，包括HTML。

您可以使用fillna方法轻松处理空单元格。

考虑这个例子：

import pandas as pd

# read_excel returns list of dataframes.
# In this case we know there is only one in the page
df = pd.read_html('http://www.basketball-reference.com/leagues/NBA_2015_per_poss.html',
                  attrs={'id': 'per_poss'})[0] 

# the headers repeat every 20 lines, filtering them out
df = df[df['Rk'] != 'Rk'] 

# inserting 0 to empty cells
# could also use inplace=True kwarg instead of reassigning, or pass a 
# dictionary to use different value for each column 
df = df.fillna(0)

处理网页中的空单元格

1 个答案: