Web Scraping BeautifulSoup- CoinMarketCap.com历史数据(查找最新日期)

时间:2018-07-31 09:09:47

标签: python web-scraping beautifulsoup

我正在尝试从coinmarketcap.com上删除历史令牌数据,并且遇到了一些问题。想在这里获得帮助。

例如,我想访问令牌的所有可用价格数据

https://coinmarketcap.com/currencies/tether/historical-data/?start=20130428&end=20180731

但是问题是我如何获得每个令牌的最早开始日期?我发现最早的开始日期可用于某些javascript函数而不是html标签。

“所有时间”中最早开始日期的快照,但具有javascript功能。

 ranges: {
        'Last 7 Days': [moment.utc().subtract(6, 'days'), moment.utc()],
        'Last 30 Days': [moment.utc().subtract(30, 'days'), moment.utc()],
        'Last 3 Months': [moment.utc().subtract(3, 'months'), moment.utc()],
        'Last 12 Months': [moment.utc().subtract(12, 'months'), moment.utc()],
        'Year To Date': [moment.utc().startOf('year'), moment.utc()],
        'All Time': ["04-28-2013", moment.utc()]

任何人都知道如何执行此操作吗?非常感谢

2 个答案:

答案 0 :(得分:0)

您需要使用Python时间和日期函数来获取所需的数据范围。像这样:

from bs4 import BeautifulSoup
import requests
import datetime


def start_of_year():
    today = datetime.datetime.utcnow().date()
    return datetime.datetime(today.year, 1, 1, tzinfo=datetime.timezone.utc)

def get_date(d):
    return datetime.datetime.strptime(d, '%m-%d-%Y')

def get_url(start, end):
    start = start.strftime("%Y%m%d")
    end = end.strftime("%Y%m%d")
    return 'https://coinmarketcap.com/currencies/tether/historical-data/?start={}&end={}'.format(start, end)


now = lambda: datetime.datetime.utcnow()
td = lambda *args, **kwargs: datetime.timedelta(*args, **kwargs)

ranges = [
    (now() - td(days=6), now()),    # last 7 days
    (now() - td(days=30), now()),   # last 30 days
    (now() - td(days=3 * 365//12), now()),   # last 3 months
    (now() - td(days=365), now()),  # last 12 months
    (start_of_year(), now()),   # year to date
    (get_date('04-28-2013'), now()) # from 04-28-2013 to now
]

for r in ranges:
    r = requests.get(get_url(*r))
    soup = BeautifulSoup(r.text, 'lxml')
    data_table = soup.select_one('table.table tbody')
    for row in data_table.select('tr'):
        print([r['data-format-value'] for r in row.select('td[data-format-value]')])
    print('-' * 80)

这将遍历所有日期/时间范围并获取值数据:

['0.999317', '1.00715', '0.993987', '0.99891', '4311410000.0', '2505427944.64']
['0.996402978897', '1.00161004066', '0.995105028152', '0.998919010162', '2404489984.0', '2498121984.0']
['0.995606005192', '1.00268995762', '0.995606005192', '0.997887015343', '2298899968.0', '2496123904.0']
['0.999553024769', '1.0043900013', '0.99117898941', '0.997117996216', '3092310016.0', '2506019840.0']
['0.996739983559', '1.00374996662', '0.99286699295', '0.997950971127', '2894500096.0', '2498967040.0']
['0.992371976376', '1.0020699501', '0.985008001328', '0.998790979385', '3514560000.0', '2488015872.0']
--------------------------------------------------------------------------------

... and so on

答案 1 :(得分:0)

网站本身已将04-28-2013定义为最早的最早日期,这意味着它们不显示该日期之后的记录。

但是幸运的是,您need not to give the exact earliest date获得了历史数据。

例如,尽管DigixDAO的实际最早日期实际上是April 18, 2016 (20160418),但默认的最早日期(20130428)却显示了正确的结果。

https://coinmarketcap.com/currencies/digixdao/historical-data/?start=20130428&end=20180731

因此,答案是不必担心最早的开始日期而只是担心。

干杯!