我是Stackoverflow的新手,也是Python的新手。我正在尝试抓取一些数据的网站。我已经设法从
段落中提取文本,并且已经从链接下载了文件。我现在想从图形容器中提取数据。
html代码如下:
<figure class="chart-container"
data-chart-type="stacked-level"
data-anchor=""
data-is-split-series="False"
data-all-data='[["Date","Residential","Non-residential","Other construction"],["15",13419.0,5858.0,6000.0],["Jun-15",13536.0,5918.0,5962.0],["Sep-15",13750.0,5870.0,5942.0],["Dec-15",14003.0,5962.0,5957.0],["16",14368.0,6104.0,5873.0],["Jun-16",14868.0,6296.0,5657.0],["Sep-16",15234.0,6524.0,5534.0],["Dec-16",15514.0,6747.0,5456.0],["17",15587.0,6756.0,5408.0],["Jun-17",15496.0,6677.0,5508.0],["Sep-17",15561.0,6597.0,5815.0],["Dec-17",15653.0,6559.0,6130.0],["18",15750.0,6590.0,6356.0],["Jun-18",15893.0,6660.0,6523.0],["Sep-18",15953.0,6710.0,6413.0],["Dec-18",16063.0,6804.0,6294.0],["19",16321.0,7064.0,6111.0],["Jun-19",16526.0,7226.0,5927.0],["Sep-19",16848.0,7408.0,5819.0],["Dec-19",16972.0,7499.0,5743.0],["20",17008.0,7342.0,5753.0],["Jun-20",17148.0,7287.0,5775.0],["Sep-20",17150.0,7201.0,5887.0],["Dec-20",17118.0,7106.0,6005.0],["21",17134.0,7050.0,6102.0],["Jun-21",17108.0,6926.0,6159.0],["Sep-21",17128.0,6788.0,6285.0],["Dec-21",17131.0,6655.0,6389.0],["22",16954.0,6595.0,6490.0],["Jun-22",16742.0,6575.0,6541.0],["Sep-22",16444.0,6606.0,6636.0],["Dec-22",15987.0,6643.0,6726.0],["23",15470.0,6684.0,6815.0],["Jun-23",14956.0,6740.0,6831.0],["Sep-23",14417.0,6786.0,6931.0],["Dec-23",13982.0,6799.0,7029.0],["24",13504.0,6783.0,7127.0],["Jun-24",13035.0,6740.0,7150.0]]'
data-show-text-every="4"
data-color="#233657,#64971c,#869eac,#00cc7a,#c5cdd3"
data-forecast-start="17"
>
Chart goes here.
</figure>
我想提取与“ data-all-data”部分相关的数据。理想情况下,我想将其保存到.csv文件中,以便重新创建图表。
import requests
from bs4 import BeautifulSoup
#create dictionary for login data
login_data = {
'UserName': 'myUsername',
'Password': 'myPassword',
'RememberMe': 'true'
}
#create a session
with requests.session() as s:
url = 'https://portal.infometrics.co.nz/Login'
r = s.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
#Add the unique login values to dictionary
login_data['ReturnUrl']= soup.find('input', attrs={'id': 'ReturnUrl'})['value']
login_data['__RequestVerificationToken']= soup.find('input', attrs={'name': '__RequestVerificationToken'})['value']
r = s.post(url, data=login_data)
#soup = BeautifulSoup(r.content, 'html5lib')
#print(soup.prettify())
#1. Find the latest 'Downloads' file extension
url = 'https://portal.infometrics.co.nz/Forecasts/Building%20forecasts'
r = s.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
el_d = soup.find(string='Data download')
url_2 = el_d.find_parent('a')['href']
#Add the extension to the known part of the url
url_1 = 'https://portal.infometrics.co.nz'
url_d = url_1+url_2
#print(url_d)
r = s.get(url_d)
#save content into .xlsx workbook
with open ('C:/Users/ZAGOOBR/Downloads/QBR_Data.xlsx','wb')as f:
f.write(r.content)
#2. Find the latest 'Chart write' up
r = s.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
el_cw = soup.find('p').getText()
#print(el_cw)
#save content into .txt file
with open ('C:/Users/ZAGOOBR/Downloads/QBR_ChartText.txt','a')as f:
f.write(el_cw)
#3. Download the chart data
list = []
el_cd = soup.find('figure', attrs = {'class':'data-all-data'})
对于初学者的任何帮助将不胜感激。
答案 0 :(得分:0)
假设其余代码按您的意愿工作,请尝试以下类似的操作
import json
import csv
# your code here
el_cd = soup.find('figure', attrs = {'class':'chart-container'})
data = el_cd.get('data-all-data')
rows = json.loads(data)
with open('your-file-here.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(rows)