我设法像这样通过pd.read_html
从hmtl表中获取数据:
In[1]:
import numpy as np
import pandas as pd
from tabulate import tabulate
URL = "https://coinmarketcap.com/all/views/all/"
df_in_list = pd.read_html(URL, attrs = {'id': 'currencies-all'})
# df_in_list has the df in element 0
df_raw = df_in_list[0]
df = df_in_list[0]
df = df[['#', 'Name', 'Symbol', 'Market Cap', 'Price' ]]
print(tabulate(df.head(), headers='keys', tablefmt='psql'))
Out[1]:
+----+-----+------------------+----------+-----------------+-----------+
| | # | Name | Symbol | Market Cap | Price |
|----+-----+------------------+----------+-----------------+-----------|
| 0 | 1 | BTC Bitcoin | BTC | $95,224,161,781 | $5398.69 |
| 1 | 2 | ETH Ethereum | ETH | $19,256,205,102 | $182.34 |
| 2 | 3 | XRP XRP | XRP | $15,031,762,618 | $0.359679 |
| 3 | 4 | LTC Litecoin | LTC | $5,530,275,811 | $90.24 |
| 4 | 5 | BCH Bitcoin Cash | BCH | $5,514,209,793 | $311.17 |
+----+-----+------------------+----------+-----------------+-----------+
通过Chrome开发工具找到了div ID:
<table class="table floating-header summary-table
js-summary-table dataTable no-footer"
id="currencies-all" <!-- this is what I need -->
style="font-size: 14px; width: 100%;" role="grid">
现在尝试从另一个URL获取数据,但是没有成功。网址是这样的:
https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190410
该表位于此div中
<div id="historical-data" class="tab-pane active">
我的代码是这样的:
In[2]:
import numpy as np
import pandas as pd
from tabulate import tabulate
URL = "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190410"
df_in_list = pd.read_html(URL, attrs = {'id': 'historical-data'})
# df_in_list has the df in element 0
df_raw = df_in_list[0]
df = df_in_list[0]
df = df[['#', 'Name', 'Symbol', 'Market Cap', 'Price' ]]
print(tabulate(df.head(), headers='keys', tablefmt='psql'))
Out[2]:
ValueError: No tables found
我想念什么?
很显然,我对此感兴趣的div中没有table
标签:
<div id="historical-data" class="tab-pane active">
是错误原因吗?
如果是这样,我还能如何获取该div内的数据?
我知道coinmarketcap.com具有API,但我更喜欢从其网站获取数据。
答案 0 :(得分:1)
是的,表的class
错误。
如果将df_in_list
更改为df_in_list = pd.read_html(URL, attrs = {'class': 'table'})
,则应该可以。
您还必须更改df = df[['#', 'Name', 'Symbol', 'Market Cap', 'Price' ]]
部分,因为这些列不在您要抓取的新表中。