我需要从提供汇总轮询号的website中获取数据点。数据以交互式图形显示。我应该如何获取每个候选人的所有数据点(日期:数字对)?我试图分析和检查源代码,但是找不到它指向的数据文件。我将对使用Python或R的解决方案感到满意。非常感谢您的帮助。
答案 0 :(得分:1)
如上所述,在开发工具中找到API调用。然后,只需获取响应并根据需要对其进行操作即可:
import requests
import pandas as pd
import json
import time
timestamp = str(int(time.time()*1000.0))
url ='https://www.realclearpolitics.com/epolls/json/6730_historical.js'
headers = {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36'}
payload = {
timestamp: '',
'callback': 'return_json'}
jsonStr = requests.get(url, headers=headers, params=payload).text
jsonData = json.loads(jsonStr.split('(',1)[-1].rsplit(')',1)[0])
results = pd.DataFrame()
df = pd.DataFrame(jsonData['poll']['rcp_avg'])
for idx, row in df.iterrows():
temp_df = pd.DataFrame(row['candidate'])
temp_df['date'] = row['date']
results = results.append(temp_df, sort=True).reset_index(drop=True)
输出:
print (results)
affiliation color date ... name status value
0 #009900 2019-11-28 06:00:00 ... Biden 1 27.0
1 #457fff 2019-11-28 06:00:00 ... Sanders 1 18.3
2 #996600 2019-11-28 06:00:00 ... Warren 1 15.8
3 #990099 2019-11-28 06:00:00 ... Buttigieg 1 11.0
4 #ff9900 2019-11-28 06:00:00 ... Harris 1 3.8
5 #3da882 2019-11-28 06:00:00 ... Yang 1 3.3
6 #f2dc0f 2019-11-28 06:00:00 ... Bloomberg 1 2.5
7 #000000 2019-11-28 06:00:00 ... Klobuchar 1 2.2
8 #66ccff 2019-11-28 06:00:00 ... Booker 1 1.8
9 #666666 2019-11-28 06:00:00 ... Steyer 1 1.7
10 #ff0074 2019-11-28 06:00:00 ... Gabbard 1 1.3
11 #cc9900 2019-11-28 06:00:00 ... Castro 1 1.2
12 #9966ff 2019-11-28 06:00:00 ... Bennet 1 0.6
13 #10671b 2019-11-28 06:00:00 ... Bullock 3 0.4
14 #990000 2019-11-28 06:00:00 ... Patrick 3 0.4
15 #6672ff 2019-11-28 06:00:00 ... Sestak 3 0.3
16 #009900 2019-11-27 06:00:00 ... Biden 1 28.2
17 #457fff 2019-11-27 06:00:00 ... Sanders 1 17.8
18 #996600 2019-11-27 06:00:00 ... Warren 1 16.7
19 #990099 2019-11-27 06:00:00 ... Buttigieg 1 10.5
20 #ff9900 2019-11-27 06:00:00 ... Harris 1 3.8
21 #3da882 2019-11-27 06:00:00 ... Yang 1 3.2
22 #f2dc0f 2019-11-27 06:00:00 ... Bloomberg 1 2.4
23 #000000 2019-11-27 06:00:00 ... Klobuchar 1 2.0
24 #66ccff 2019-11-27 06:00:00 ... Booker 1 1.7
25 #666666 2019-11-27 06:00:00 ... Steyer 1 1.7
26 #ff0074 2019-11-27 06:00:00 ... Gabbard 1 1.5
27 #cc9900 2019-11-27 06:00:00 ... Castro 1 1.0
28 #9966ff 2019-11-27 06:00:00 ... Bennet 1 0.8
29 #10671b 2019-11-27 06:00:00 ... Bullock 3 0.4
... ... ... ... ... ... ...
5650 #996600 2018-12-10 06:00:00 ... Warren 1 6.0
5651 #990099 2018-12-10 06:00:00 ... Buttigieg 1 NaN
5652 #ff9900 2018-12-10 06:00:00 ... Harris 1 5.3
5653 #3da882 2018-12-10 06:00:00 ... Yang 1 NaN
5654 #f2dc0f 2018-12-10 06:00:00 ... Bloomberg 1 NaN
5655 #000000 2018-12-10 06:00:00 ... Klobuchar 1 NaN
5656 #66ccff 2018-12-10 06:00:00 ... Booker 1 4.0
5657 #666666 2018-12-10 06:00:00 ... Steyer NaN NaN
5658 #ff0074 2018-12-10 06:00:00 ... Gabbard 1 NaN
5659 #cc9900 2018-12-10 06:00:00 ... Castro 1 NaN
5660 #9966ff 2018-12-10 06:00:00 ... Bennet 1 NaN
5661 #10671b 2018-12-10 06:00:00 ... Bullock 3 NaN
5662 #990000 2018-12-10 06:00:00 ... Patrick NaN NaN
5663 #6672ff 2018-12-10 06:00:00 ... Sestak NaN NaN
5664 #009900 2018-12-09 06:00:00 ... Biden 1 29.0
5665 #457fff 2018-12-09 06:00:00 ... Sanders 1 17.7
5666 #996600 2018-12-09 06:00:00 ... Warren 1 6.0
5667 #990099 2018-12-09 06:00:00 ... Buttigieg 1 NaN
5668 #ff9900 2018-12-09 06:00:00 ... Harris 1 5.3
5669 #3da882 2018-12-09 06:00:00 ... Yang 1 NaN
5670 #f2dc0f 2018-12-09 06:00:00 ... Bloomberg 1 NaN
5671 #000000 2018-12-09 06:00:00 ... Klobuchar 1 NaN
5672 #66ccff 2018-12-09 06:00:00 ... Booker 1 4.0
5673 #666666 2018-12-09 06:00:00 ... Steyer NaN NaN
5674 #ff0074 2018-12-09 06:00:00 ... Gabbard 1 NaN
5675 #cc9900 2018-12-09 06:00:00 ... Castro 1 NaN
5676 #9966ff 2018-12-09 06:00:00 ... Bennet 1 NaN
5677 #10671b 2018-12-09 06:00:00 ... Bullock 3 NaN
5678 #990000 2018-12-09 06:00:00 ... Patrick NaN NaN
5679 #6672ff 2018-12-09 06:00:00 ... Sestak NaN NaN
[5680 rows x 7 columns]
如您所见,绘制图表时,它看起来像网站上的图形:
# Convert columns to appropriate type to chart
results['value'] = results['value'].astype(float)
results['date'] = pd.to_datetime(results['date'])
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
palette = pd.Series(results.color.values,index=results.name).to_dict()
sns.lineplot(data=results, x="date", y="value", hue="name", palette=palette)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)