我正在尝试从以下网页抓取图形数据:“ https://cawp.rutgers.edu/women-percentage-2020-candidates”
我尝试使用以下代码从Graph中提取数据:
import requests
from bs4 import BeautifulSoup
Res = requests.get('https://cawp.rutgers.edu/women-percentage-2020-candidates').text
soup = BeautifulSoup(Res, "html.parser")
Values= [i.text for i in soup.findAll('g', {'class': 'igc-graph'}) if i]
Dates = [i.text for i in soup.findAll('g', {'class': 'igc-legend-entry'}) if i]
print(Values, Dates) ## both list are empty
Data= pd.DataFrame({'Value':Values,'Date':Dates}) ## Returning an Empty Dataframe
我想从所有4个条形图中提取日期和值。请任何人建议我在这里要提取图形数据,或者有没有其他方法可以尝试提取数据。谢谢;
答案 0 :(得分:1)
您可以尝试使用此脚本从页面中提取一些数据:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://cawp.rutgers.edu/women-percentage-2020-candidates'
infogram_url = 'https://e.infogram.com/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def find_data(d):
if isinstance(d, dict):
for k, v in d.items():
if k == 'data' and isinstance(v, list):
yield v
else:
yield from find_data(v)
elif isinstance(d, list):
for v in d:
yield from find_data(v)
for i in soup.select('.infogram-embed'):
print(i['data-title'])
html_data = requests.get(infogram_url + i['data-id']).text
data = re.search(r'window\.infographicData=({.*})', html_data).group(1)
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in find_data(data):
print(d)
print('-' * 80)
打印:
Candidate Tracker 2020_US House_Proportions
[[['', 'Districts Already Filed'], ['2020', '435']]]
[[['', '2016', '2018', '2020'], ['Filed', '17.8%', '24.2%', '29.1%']], [['', '2016', '2018', '2020'], ['Filed', '25.1%', '32.5%', '37.9%']], [['', '2016', '2018', '2020'], ['Filed', '11.5%', '13.7%', '21.3%']]]
--------------------------------------------------------------------------------
Candidate Tracker Nominees 2020_US House_Proportions
[[['', 'Possible Major-Party Nominations Decided', 'Possible Major-Party Nominations Left to be Decided'], ['2020', '829', '18']]]
[[['', '', '2018', '2020'], ['Percent of Nominees', '', '28.4%', '35.6%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '42.4%', '48.3%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '13.2%', '22.5%']]]
--------------------------------------------------------------------------------
Candidate Tracker 2020_US Senate_Proportions
[[['', 'States with Senate Contests Already Filed'], ['2020', '34']]]
[[['', '', '2018', '2020'], ['Filed', '', '20.9%', '23.9%']], [['', '', '2018', '2020'], ['Filed', '', '32.6%', '31.1%']], [['', '', '2018', '2020'], ['Filed', '', '14%', '17.4%']]]
--------------------------------------------------------------------------------
Candidate Tracker Nominees 2020_US Senate_Proportions
[[['', 'Decided', 'Left to be Decided'], ['2020', '29', '6']]]
[[['', '', '2018', '2020'], ['Percent of Nominees', '', '32.6%', '31.6%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '42.9%', '39.3%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '23.5%', '24.1%']]]
--------------------------------------------------------------------------------