我正在尝试学习如何使用Python,请求和BeautifulSoup从Coinmarketcap.com网站抓取BTC历史数据。
我想解析以下内容:
1)日期
2)关闭
3)音量
4)市值
到目前为止,这是我的代码:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent()
header = {'user-agent': ua.chrome}
response = requests.get('https://coinmarketcap.com/currencies/bitcoin/historical-data/', headers=header)
# html.parser
soup = BeautifulSoup(response.content,'lxml')
tags = soup.find_all('td')
print(tags)
我能够抓取所需的数据,但不确定如何正确解析。我希望日期尽量回溯(“所有时间”)。任何建议将不胜感激。预先感谢!
答案 0 :(得分:2)
您可以为此requests
和lxml
这里是一个函数coinmarketcap_get_btc
,它将开始日期和结束日期作为参数并收集相关数据
import lxml.html
import pandas
import requests
def float_helper(string):
try:
return float(string)
except ValueError:
return None
def coinmarketcap_get_btc(start_date: str, end_date: str) -> pandas.DataFrame:
# Build the url
url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'
# Make the request and parse the tree
response = requests.get(url, timeout=5)
tree = lxml.html.fromstring(response.text)
# Extract table and raw data
table = tree.find_class('table-responsive')[0]
raw_data = [_.text_content() for _ in table.find_class('text-right')]
# Process the data
col_names = ['Date'] + raw_data[:6]
row_list = []
for x in raw_data[6:]:
_, date, _open, _high, _low, _close, _vol, _m_cap, _ = x.replace(',', '').split('\n')
row_list.append([date, float_helper(_open), float_helper(_high), float_helper(_low),
float_helper(_close), float_helper(_vol), float_helper(_m_cap)])
return pandas.DataFrame(data=row_list, columns=col_names)
您始终可以忽略不需要的列,并添加其他功能(例如,接受datetime.datetime
对象作为日期)。
注意,用于构建URL的f-string
至少需要Python 3.x版本(我相信是3.6),因此,如果您使用的是旧版本,则只需使用其中一个'string{var}.format(var=var)'
或'string%s' % var
表示法。
示例
df = coinmarketcap_get_btc(start_date='20130428', end_date='20191020')
df
# Date Open* High Low Close** Volume Market Cap
# 0 Oct 19 2019 7973.80 8082.63 7944.78 7988.56 1.379783e+10 1.438082e+11
# 1 Oct 18 2019 8100.93 8138.41 7902.16 7973.21 1.565159e+10 1.435176e+11
# 2 Oct 17 2019 8047.81 8134.83 8000.94 8103.91 1.431305e+10 1.458540e+11
# 3 Oct 16 2019 8204.67 8216.81 7985.09 8047.53 1.607165e+10 1.448240e+11
# 4 Oct 15 2019 8373.46 8410.71 8182.71 8205.37 1.522041e+10 1.476501e+11
# ... ... ... ... ... ... ... ...
# 2361 May 02 2013 116.38 125.60 92.28 105.21 NaN 1.168517e+09
# 2362 May 01 2013 139.00 139.89 107.72 116.99 NaN 1.298955e+09
# 2363 Apr 30 2013 144.00 146.93 134.05 139.00 NaN 1.542813e+09
# 2364 Apr 29 2013 134.44 147.49 134.00 144.54 NaN 1.603769e+09
# 2365 Apr 28 2013 135.30 135.98 132.10 134.21 NaN 1.488567e+09
#
# [2366 rows x 7 columns]
答案 1 :(得分:2)
这是使用BeautifulSoup
库从表中获取上述字段的方法之一。我使用.select()
而不是.find_all()
来查找所需的项目。
工作解决方案:
import pandas
import requests
from bs4 import BeautifulSoup
link = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={}&end={}'
def get_coinmarketcap_info(url,s_date,e_date):
response = requests.get(url.format(s_date,e_date))
soup = BeautifulSoup(response.text,"lxml")
for items in soup.select("table.table tr.text-right"):
date = items.select_one("td.text-left").get_text(strip=True)
close = items.select_one("td[data-format-market-cap]").find_previous_sibling().get_text(strip=True)
volume = items.select_one("td[data-format-market-cap]").get_text(strip=True)
marketcap = items.select_one("td[data-format-market-cap]").find_next_sibling().get_text(strip=True)
yield date,close,volume,marketcap
if __name__ == '__main__':
dataframe = (elem for elem in get_coinmarketcap_info(link,s_date='20130428',e_date='20191020'))
df = pandas.DataFrame(dataframe)
print(df)
答案 2 :(得分:1)
您可以使用一个函数,该函数需要返回数月的时间(您可以更改此值,但几个月是一个很好的示例),然后使用pandas read_html来获取表和列的子集。目前已设置为从今天起生效。
import requests
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
def get_date_range(number_of_months:int):
now = datetime.now()
dt_end = now.strftime("%Y%m%d")
dt_start = (now - relativedelta(months=number_of_months)).strftime("%Y%m%d")
return f'start={dt_start}&end={dt_end}'
number_of_months = 3
table = pd.read_html(f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?{get_date_range(number_of_months)}')[0]
table = table[['Date', 'Close**', 'Volume','Market Cap']]
print(table)