使用 BeautifulSoup 和 json 进行网页抓取

时间:2021-06-27 16:34:25

标签: python json web-scraping beautifulsoup

我正在尝试构建一个 webscraper,它将提取有关加密货币价格的历史数据,但是当我尝试打印出数据时,输出却没有读取。这是代码:

             #Libraries
             import requests
             from bs4 import BeautifulSoup
             import json
             import time
             import pandas as pd

             coins = {}

             cm = requests.get('https://coinmarketcap.com/')
             soup = BeautifulSoup(cm.content, 'html.parser')

             data = soup.find('script', 
                    id="__NEXT_DATA__",type="application/json") 

             coin_data = json.loads(data.contents[0])
             listings = coin_data['props']['initialState'] 
                        ['cryptocurrency']['listingLatest'] 
                        ['data']

             for i in listings:
               coins[str(i['id'])] = i['slug']

             for i in coins:
               page = 
               requests.get(f'https://coinmarketcap.com/
               currencies/{coins[i]}/historical-data/?2021
               0101&20210627')


             soup = BeautifulSoup(page.content, 'html.parser')
             data = soup.find('script', 
                    id="__NEXT_DATA__",type="application/json")
                    hitorical_data = json.loads(data.contents[0])


             print(data.cardano)

2 个答案:

答案 0 :(得分:1)

如果您在浏览器中查看该页面,并在查看历史数据的同时记录浏览器的网络流量,您将看到向提供 JSON 的 REST API 发出 HTTP GET 请求,其中包含您可能想要的所有信息。您所要做的就是模仿该请求 - 不需要 BeautifulSoup 或 Pandas:

def get_historical_data(currency_id, start, end):
    import requests

    url = "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical"

    params = {
        "id": currency_id,
        "convertId": "2781", # seems to be USD
        "timeStart": start,
        "timeEnd": end

    }

    headers = {
        "accept": "application/json",
        "accept-encoding": "gzip, deflate",
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    for quote in response.json()["data"]["quotes"]:
        yield quote["timeClose"], quote["quote"]["close"]

def main():

    from datetime import datetime

    start = str(round(datetime(2021, 1, 1).timestamp()))
    end = str(round(datetime.now().timestamp()))

    currency_ids = {
        "BTC": "1"
    }

    for time_close, close in get_historical_data(currency_ids["BTC"], start, end):
        print("[{}]: {}".format(time_close, close))

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

[2021-01-02T23:59:59.999Z]: 32127.27
[2021-01-03T23:59:59.999Z]: 32782.02
[2021-01-04T23:59:59.999Z]: 31971.91
[2021-01-05T23:59:59.999Z]: 33992.43
[2021-01-06T23:59:59.999Z]: 36824.36
[2021-01-07T23:59:59.999Z]: 39371.04
[2021-01-08T23:59:59.999Z]: 40797.61
[2021-01-09T23:59:59.999Z]: 40254.55
[2021-01-10T23:59:59.999Z]: 38356.44
[2021-01-11T23:59:59.999Z]: 35566.66
[2021-01-12T23:59:59.999Z]: 33922.96
...

答案 1 :(得分:0)

Helm Release