如何从https://bitinfocharts.com上的图表中废弃数据

时间:2019-12-18 15:30:49

标签: python web-scraping graph charts

我想使用python或r将https://bitinfocharts.com上某条推文的体积图表中的数据剪贴到某种数据文件中。我是python的新手,不知道该怎么做。我在论坛上看过其他问题,但我做不到

我感兴趣的图表如下:https://bitinfocharts.com/comparison/decred-tweets.html#1y

我正在寻找一个数据表,其中以每个日期和当天的相应推文数为列。

非常感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

可能有更优雅的解决方案,但数据嵌入在脚本标签中。只需将其取出并解析为表即可:

import requests 
from bs4 import BeautifulSoup
import pandas as pd
import re


def parse_strlist(sl):
    clean = re.sub("[\[\],\s]","",sl)
    splitted = re.split("[\'\"]",clean)
    values_only = [s for s in splitted if s != '']
    return values_only


url = 'https://bitinfocharts.com/comparison/decred-tweets.html#1y'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script')
for script in scripts:
    if 'd = new Dygraph(document.getElementById("container")' in script.text:
        StrList = script.text
        StrList = '[[' + StrList.split('[[')[-1]
        StrList = StrList.split(']]')[0] +']]'
        StrList = StrList.replace("new Date(", '').replace(')','')
        dataList = parse_strlist(StrList)

date = []
tweet = []
for each in dataList:
    if (dataList.index(each) % 2) == 0:
        date.append(each)
    else:
        tweet.append(each)

df = pd.DataFrame(list(zip(date, tweet)), columns=["Date","Decred - Tweets"])

输出:

print (df)
           Date Decred - Tweets
0    2018/01/08              69
1    2018/01/09             200
2    2018/01/10             163
3    2018/01/11             210
4    2018/01/12             256
5    2018/01/13             185
6    2018/01/14             147
7    2018/01/15             119
8    2018/01/16             169
9    2018/01/17             176
10   2018/01/18             209
11   2018/01/19             179
12   2018/01/20             274
13   2018/01/21             124
14   2018/01/22             185
15   2018/01/23             110
16   2018/01/24             109
17   2018/01/25              86
18   2018/01/26              49
19   2018/01/27            null
20   2018/01/28            null
21   2018/01/29            null
22   2018/01/30            null
23   2018/01/31             194
24   2018/02/01             197
25   2018/02/02             163
26   2018/02/03              73
27   2018/02/04              98
28   2018/02/05             210
29   2018/02/06             215
..          ...             ...
680  2019/11/19              58
681  2019/11/20              67
682  2019/11/21              72
683  2019/11/22              79
684  2019/11/23              46
685  2019/11/24              38
686  2019/11/25              81
687  2019/11/26              57
688  2019/11/27              54
689  2019/11/28              60
690  2019/11/29              55
691  2019/11/30              40
692  2019/12/01              39
693  2019/12/02              71
694  2019/12/03              93
695  2019/12/04              44
696  2019/12/05              41
697  2019/12/06              34
698  2019/12/07              40
699  2019/12/08              44
700  2019/12/09              47
701  2019/12/10              47
702  2019/12/11              64
703  2019/12/12              61
704  2019/12/13              67
705  2019/12/14              93
706  2019/12/15              59
707  2019/12/16              86
708  2019/12/17              82
709  2019/12/18              51

[710 rows x 2 columns]