Question

如果这个问题在其他地方得到解答，我很抱歉，但我在这里或其他地方找不到满意的答案都没有成功。

我对python和pandas有点新，并且在将HTML数据导入pandas数据帧时遇到了一些困难。在pandas文档中，它说.read_html（）返回一个数据框对象列表，所以当我尝试做一些数据操作来摆脱一些样本时，我得到一个错误。

以下是我阅读HTML的代码：

df = pd.read_html('http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)

然后我尝试清理它：

df = df.dropna(axis=0, thresh=4)

我收到以下错误：

Traceback (most recent call last): File "module4.py", line 25, in
<module> df = df.dropna(axis=0, thresh=4) AttributeError: 'list'
object has no attribute 'dropna'

如何将这些数据导入实际数据帧，类似于.read_csv（）？

Answer 1

从http://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-read-html开始，“read_html返回一个DataFrame对象列表，即使HTML内容中只包含一个表”。

所以df = df[0].dropna(axis=0, thresh=4)应该做你想做的事。

Answer 2

pd.read_html返回一个包含一个元素的列表，该元素是pandas数据帧，即

df = pd.read_html(url) ###<-- List

df[0] ###<-- Pandas DataFrame

将HTML表格放入pandas Dataframe，而不是数据框对象列表

2 个答案: