Question

我正试图从coinmarketcap.com（从2012年1月1日到今天）收集超过2000天的加密货币的每日总市值。每天的总市值在不同的网页上，因此我使用python中的beautifulsoup对所有这些页面进行了转义。但是，我的请求似乎被阻止，因为我查询得太快和太频繁了。有没有办法刮这些页面而不会被阻塞？我的网页抓取代码如下：

print("Collecting Total Market Capitalizations...")
all_dates = [x.strftime("%Y-%m-%d") for x in pd.date_range(start="2020-01-01", end=datetime.now().strftime("%Y-%m-%d"))]
output_name = "data/marketcap.csv"
content = [["Date", "TotalMarketCap"]]
for d in all_dates:
    print("We are at " + d, end="\r")
    url = "https://coinmarketcap.com/historical/" + d.replace("-", "")
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    row = soup.find_all('strong')
    if len(row) > 0:
        row = row[0]         
        curr_cap = row.getText().split("$")[1].replace(",", "")
        content.append([str(d), str(curr_cap)])
print("")
with open(output_name, 'w') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerows(content)

谢谢！

Answer 1

您可以执行以下两项操作之一：

在每个请求上使用不同的IP地址，因此网站不会意识到尝试访问同一资源的是同一客户端。
使代码（如果有足够的时间）在两次连续的页面获取之间进入睡眠/暂停状态。

使用beautifulsoup刮刮多个网站而不会被阻止

1 个答案: