我正在尝试使用python从网站(网址中带有#号的网址:http://www.epa.ie/hydronet/#Water%20Levels)中抓取数据,但将其解析为html文件时收到以下错误消息:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>
任何帮助表示赞赏。
答案 0 :(得分:1)
该页面上的数据通过Ajax动态加载。在Firefox网络检查器中,有很多Json数据文件正在加载,例如此文件(警告,巨大!):
import requests
import json
from pprint import pprint
url = "http://www.epa.ie/Hydronet/output/internet/layers/10/index.json"
data = json.loads(requests.get(url).text)
pprint(data)
这将以以下格式打印约3000个电台的数据:
...
{'L1_CATCHMENT_SIZE': '0.00 km²',
'L1_DATA_AVAILABLE': 'Water Level Only',
'L1_LTA_RAINFALL_1961_1990': '',
'L1_ObjectDescription': '',
'L1_RESPONSIBLE_BODY': 'Waterways Ireland',
'L1_STATION_OWNER': 'Waterways Ireland',
'L1_TYPE_OF_GAUGING': 'Recorder',
'L1_WEB_GW_height_system_suffix': '',
'L1_WTO_OBJECT': 'GRAND CANAL',
'L1_Web_Desc': '',
'L1_Web_ELT_95PERCENTILE': '',
'L1_Web_E_50PERCENTILE': '',
'L1_Web_Legend': 'Active Waterways Ireland',
'L1_Web_Link': '<p><strong><b><a '
'href="http://netview.ott.com/waterwaysireland-le/">CLICK '
'HERE for Waterways Ireland Station Data</a></strong><b></p>',
'L1_admin_name': '---',
'L1_area_name': '',
'L1_label': 'Stage',
'L1_req_timestamp': None,
'L1_river_name': 'GRAND CANAL',
'L1_station_GWREF_DATUM': '',
'L1_station_gauge_datum': '0.0',
'L1_station_gauge_datum_unit': '---',
'L1_station_status': 'Active',
'L1_stationparameter_name': 'Stage',
'L1_stationparameter_no': 'S',
'L1_timestamp': None,
'L1_ts_id': 40185010,
'L1_ts_name': 'StaffGaugeCheck',
'L1_ts_unitsymbol': 'm',
'L1_ts_value': None,
'L1_web_type_gw': '',
'L1_web_waterbody': '',
'metadata_CATCHMENT_SIZE': '0.00 km²',
'metadata_RESPONSIBLE_BODY': 'Waterways Ireland',
'metadata_STATION_OWNER': 'Waterways Ireland',
'metadata_TYPE_OF_GAUGING': 'Recorder',
'metadata_WTO_OBJECT': 'GRAND CANAL',
'metadata_Web_ELT_95PERCENTILE': '',
'metadata_Web_E_50PERCENTILE': '',
'metadata_Web_Legend': 'Active Waterways Ireland',
'metadata_admin_name': '---',
'metadata_area_name': '',
'metadata_catchment_name': 'Shannon',
'metadata_river_name': 'GRAND CANAL',
'metadata_station_carteasting': '224943.99999999997',
'metadata_station_cartnorthing': '225708.0000000003',
'metadata_station_gauge_datum_unit': '---',
'metadata_station_id': '3049391',
'metadata_station_latitude': '53.281124634829155',
'metadata_station_local_x': '224943.99999999997',
'metadata_station_local_y': '225708.0000000003',
'metadata_station_longitude': '-7.625991577700672',
'metadata_station_name': 'KIRKWINS BR',
'metadata_station_no': '25069',
'metadata_station_status': 'Active'},
... and so on
其他数据文件很少,您需要查看网络检查器中的URL。
编辑:
要打印“ metadata_station_name”和“ L1_ts_value”,可以使用以下代码:
import requests
import json
url = "http://epa.ie/Hydronet/output/internet/layers/20/index.json"
data = json.loads(requests.get(url).text)
for station in data:
print(station['metadata_station_name'], station['L1_ts_value'])
print('-' * 80)
打印:
BALLYMAN None
--------------------------------------------------------------------------------
CARRIGAHORIG 0.351
--------------------------------------------------------------------------------
KILCOLGAN 0.668
--------------------------------------------------------------------------------
CASTLEMARTYR None
--------------------------------------------------------------------------------
BALLEA 0.376
--------------------------------------------------------------------------------
CAHERFINESKER 0.000
--------------------------------------------------------------------------------
PORTUMNA 701.200
--------------------------------------------------------------------------------
... and so on.