Question

我有一些问题网上用漂亮的汤抓取一些数据，我想知道你们是否有任何刮刀专业人士可以给我一些指导。

这是我想要抓住的确切网页： https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20171013

具体来说，我想抓住历史价格表并以某种方式将信息提取到DataFrame中。但首先我需要在原始html中找到它。

import requests
from bs4 import BeautifulSoup

data = requests.get('https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20171013') 

soup = BeautifulSoup(data._content, 'html.parser')

不幸的是，我收到编码错误

UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 22075: ordinal not in range(128)

有没有办法在将原始html传递给漂亮的汤之前，基本上只删除所有无法编码的字符？

Answer 1

BeautifulSoup(data._content.decode('utf-8'))

首先尝试解码utf-8。

如果仍有问题，可以告诉解码器忽略错误：

BeautifulSoup(data._content.decode('utf-8', 'ignore))

Answer 2

这并没有直接回答你的问题 - 我不知道如何设置lxml作为解析器，但我成功找到并提取了数据 - 注意有一种方法可以在BS中使用LXML，但我直接使用LXML而不是访问它通过BS

from lxml import html
import requests


data = requests.get('https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20171013').content
tree = html.fromstring(data)
# note I did not want to sort out the logic to find the table so I cheated
# and selected the table with a specific data value
mytable = tree.xpath('//td[contains(.,"4829.58")]/ancestor::table')[0]
for e in mytable.iter(tag='tr'):
    e.text_content()

    '\n                        Date\n                        Open\n                        High\n                        Low\n                        Close\n                        Volume\n                        Market Cap\n                    '
   '\n                        Oct 12, 2017\n                        4829.58\n                        5446.91\n                        4822.00\n                        5446.91\n                        2,791,610,000\n                        80,256,700,000\n                        '

我认为unicode问题是在树的上方（除了你正在寻找的表之外的一些元素），所以我从表中获取数据或将结果写入文件没有问题。

Answer 3

这是一个对我有用的解决方案（这个例子适用于 Python 3）：

# use urllib to get HTML data
url = "https://coinmarketcap.com/historical/20201206/"
contents = urllib.request.urlopen(url)
bytes_str = contents.read()

# decode bytes string
data_str = bytes_str.decode("utf-8")

# crop the raw JSON string out of the website HTML
start_str = '"listingHistorical":{"data":'
start = data_str.find(start_str)+len(start_str)
end = data_str.find(',"page":1,"sort":""')
cropped_str = data_str[start:end]

# create a Python list from JSON string
data_list = json.loads(cropped_str)
print ("total cryptos:", len(data_list))

# iterate over the list of crypto dicts
for i, item in enumerate(data_list):

    # pretty print all cryptos with a high rank
    if item["cmc_rank"] < 30:
        print (json.dumps(item, indent=4))

要从其他日期获取其他数据，只需将网址中的 20201206 部分替换为首选日期（例如，使用 20210110 代替 2021 年 1 月 10 日）。

从coinmarketcap.com网上抓取历史比特币数据

3 个答案: