从网址中带有“#”的网站抓取数据时出错

时间:2018-07-23 13:19:38

标签: python html web-scraping

我正在尝试使用python从网站(网址中带有#号的网址:http://www.epa.ie/hydronet/#Water%20Levels)中抓取数据,但将其解析为html文件时收到以下错误消息:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">

<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>

任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:1)

该页面上的数据通过Ajax动态加载。在Firefox网络检查器中,有很多Json数据文件正在加载,例如此文件(警告,巨大!):

import requests
import json
from pprint import pprint

url = "http://www.epa.ie/Hydronet/output/internet/layers/10/index.json"

data = json.loads(requests.get(url).text)
pprint(data)

这将以以下格式打印约3000个电台的数据:

...
 {'L1_CATCHMENT_SIZE': '0.00 km²',
  'L1_DATA_AVAILABLE': 'Water Level Only',
  'L1_LTA_RAINFALL_1961_1990': '',
  'L1_ObjectDescription': '',
  'L1_RESPONSIBLE_BODY': 'Waterways Ireland',
  'L1_STATION_OWNER': 'Waterways Ireland',
  'L1_TYPE_OF_GAUGING': 'Recorder',
  'L1_WEB_GW_height_system_suffix': '',
  'L1_WTO_OBJECT': 'GRAND CANAL',
  'L1_Web_Desc': '',
  'L1_Web_ELT_95PERCENTILE': '',
  'L1_Web_E_50PERCENTILE': '',
  'L1_Web_Legend': 'Active Waterways Ireland',
  'L1_Web_Link': '<p><strong><b><a '
                 'href="http://netview.ott.com/waterwaysireland-le/">CLICK '
                 'HERE for Waterways Ireland Station Data</a></strong><b></p>',
  'L1_admin_name': '---',
  'L1_area_name': '',
  'L1_label': 'Stage',
  'L1_req_timestamp': None,
  'L1_river_name': 'GRAND CANAL',
  'L1_station_GWREF_DATUM': '',
  'L1_station_gauge_datum': '0.0',
  'L1_station_gauge_datum_unit': '---',
  'L1_station_status': 'Active',
  'L1_stationparameter_name': 'Stage',
  'L1_stationparameter_no': 'S',
  'L1_timestamp': None,
  'L1_ts_id': 40185010,
  'L1_ts_name': 'StaffGaugeCheck',
  'L1_ts_unitsymbol': 'm',
  'L1_ts_value': None,
  'L1_web_type_gw': '',
  'L1_web_waterbody': '',
  'metadata_CATCHMENT_SIZE': '0.00 km²',
  'metadata_RESPONSIBLE_BODY': 'Waterways Ireland',
  'metadata_STATION_OWNER': 'Waterways Ireland',
  'metadata_TYPE_OF_GAUGING': 'Recorder',
  'metadata_WTO_OBJECT': 'GRAND CANAL',
  'metadata_Web_ELT_95PERCENTILE': '',
  'metadata_Web_E_50PERCENTILE': '',
  'metadata_Web_Legend': 'Active Waterways Ireland',
  'metadata_admin_name': '---',
  'metadata_area_name': '',
  'metadata_catchment_name': 'Shannon',
  'metadata_river_name': 'GRAND CANAL',
  'metadata_station_carteasting': '224943.99999999997',
  'metadata_station_cartnorthing': '225708.0000000003',
  'metadata_station_gauge_datum_unit': '---',
  'metadata_station_id': '3049391',
  'metadata_station_latitude': '53.281124634829155',
  'metadata_station_local_x': '224943.99999999997',
  'metadata_station_local_y': '225708.0000000003',
  'metadata_station_longitude': '-7.625991577700672',
  'metadata_station_name': 'KIRKWINS BR',
  'metadata_station_no': '25069',
  'metadata_station_status': 'Active'},

  ... and so on

其他数据文件很少,您需要查看网络检查器中的URL。

编辑:

要打印“ metadata_station_name”和“ L1_ts_value”,可以使用以下代码:

import requests
import json

url = "http://epa.ie/Hydronet/output/internet/layers/20/index.json"

data = json.loads(requests.get(url).text)
for station in data:
    print(station['metadata_station_name'], station['L1_ts_value'])
    print('-' * 80)

打印:

BALLYMAN None
--------------------------------------------------------------------------------
CARRIGAHORIG 0.351
--------------------------------------------------------------------------------
KILCOLGAN 0.668
--------------------------------------------------------------------------------
CASTLEMARTYR None
--------------------------------------------------------------------------------
BALLEA 0.376
--------------------------------------------------------------------------------
CAHERFINESKER 0.000
--------------------------------------------------------------------------------
PORTUMNA 701.200
--------------------------------------------------------------------------------
... and so on.