通过熊猫read_html获取HTML表无法正常工作

时间:2019-04-10 19:42:37

标签: python pandas dataframe beautifulsoup

什么有效

我设法像这样通过pd.read_html从hmtl表中获取数据:

In[1]:

import numpy as np
import pandas as pd
from tabulate import tabulate

URL = "https://coinmarketcap.com/all/views/all/"
df_in_list = pd.read_html(URL, attrs = {'id': 'currencies-all'})

# df_in_list has the df in element 0
df_raw = df_in_list[0]  
df = df_in_list[0]  

df = df[['#', 'Name', 'Symbol', 'Market Cap', 'Price' ]]

print(tabulate(df.head(), headers='keys', tablefmt='psql'))
Out[1]:

+----+-----+------------------+----------+-----------------+-----------+
|    |   # | Name             | Symbol   | Market Cap      | Price     |
|----+-----+------------------+----------+-----------------+-----------|
|  0 |   1 | BTC Bitcoin      | BTC      | $95,224,161,781 | $5398.69  |
|  1 |   2 | ETH Ethereum     | ETH      | $19,256,205,102 | $182.34   |
|  2 |   3 | XRP XRP          | XRP      | $15,031,762,618 | $0.359679 |
|  3 |   4 | LTC Litecoin     | LTC      | $5,530,275,811  | $90.24    |
|  4 |   5 | BCH Bitcoin Cash | BCH      | $5,514,209,793  | $311.17   |
+----+-----+------------------+----------+-----------------+-----------+

通过Chrome开发工具找到了div ID:

<table class="table floating-header summary-table 
js-summary-table dataTable no-footer" 
id="currencies-all"   <!-- this is what I need -->
style="font-size: 14px; width: 100%;" role="grid">

什么不起作用

现在尝试从另一个URL获取数据,但是没有成功。网址是这样的:

https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190410

该表位于此div中

<div id="historical-data" class="tab-pane active">

我的代码是这样的:


In[2]:

import numpy as np
import pandas as pd
from tabulate import tabulate

URL = "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190410"
df_in_list = pd.read_html(URL, attrs = {'id': 'historical-data'})

# df_in_list has the df in element 0
df_raw = df_in_list[0]  
df = df_in_list[0]  

df = df[['#', 'Name', 'Symbol', 'Market Cap', 'Price' ]]

print(tabulate(df.head(), headers='keys', tablefmt='psql'))
Out[2]:

ValueError: No tables found

我想念什么?

编辑

很显然,我对此感兴趣的div中没有​​table标签:

<div id="historical-data" class="tab-pane active">

是错误原因吗?

如果是这样,我还能如何获取该div内的数据?

编辑2

我知道coinmarketcap.com具有API,但我更喜欢从其网站获取数据。

1 个答案:

答案 0 :(得分:1)

是的,表的class错误。

如果将df_in_list更改为df_in_list = pd.read_html(URL, attrs = {'class': 'table'}),则应该可以。

您还必须更改df = df[['#', 'Name', 'Symbol', 'Market Cap', 'Price' ]]部分,因为这些列不在您要抓取的新表中。