如何抓取生成更新数据表的网页

时间:2021-07-30 02:50:35

标签: python beautifulsoup

我正在尝试使用漂亮的汤为网站 Scraper 构建一个简单的 Python 代码。

该网站是一个 earthquake monitoring site,它看起来相当复杂并且与地理地图集成在一起,但我只对其中一个表格信息感兴趣,该信息只会在地震发生(或做出预测)时更新,我需要的信息在右下角的“更多”按钮下,您可以在其中选择一个特定的感兴趣的城市。

我想做的是,抓取这些信息,检查最新信息是否有任何更新,如果“目标首府/城市中的最大地震强度”列更新为数字大于 4(顶行的最新数据),我希望代码能够返回真/假布尔输出。(因此我可以使用 LabView 中的代码模块控制仪器)

有人可以帮我解决这个问题吗? 非常感谢!

1 个答案:

答案 0 :(得分:2)

该站点是动态的,因为它使用脚本来查询端点并使用返回的数据填充表。因此,您可以使用诸如 selenium 之类的浏览器操作工具来访问页面或查询端点并自己解析 JSON 响应:

import requests, json
data = json.loads(requests.get('https://www.jma.go.jp/bosai/quake/data/list.json?__time__=202107300300').text)
result = [{'observed':i['at'], 'region':i['en_anm'], 'magnitude':i['mag'], 'max intensity':i['maxi']} for i in data]

输出(result 的前十行):

[{'observed': '2021-07-30T03:48:00+09:00', 'region': 'Off the Coast of Iwate Prefecture', 'magnitude': '3.7', 'max intensity': '1'}, {'observed': '2021-07-30T03:26:00+09:00', 'region': 'Southern Kyoto Prefecture', 'magnitude': '3.6', 'max intensity': '3'}, {'observed': '2021-07-30T03:26:00+09:00', 'region': 'Southern Kyoto Prefecture', 'magnitude': '3.6', 'max intensity': ''}, {'observed': '2021-07-30T03:26:00+09:00', 'region': '', 'magnitude': '', 'max intensity': '3'}, {'observed': '2021-07-29T21:22:00+09:00', 'region': 'Adjacent Sea of\u200b Chichijima Island', 'magnitude': '4.2', 'max intensity': '1'}, {'observed': '2021-07-29T21:17:00+09:00', 'region': 'Adjacent Sea of\u200b Chichijima Island', 'magnitude': '4.1', 'max intensity': '1'}, {'observed': '2021-07-29T18:57:00+09:00', 'region': 'Adjacent Sea of Tokara Islands', 'magnitude': '2.0', 'max intensity': '1'}, {'observed': '2021-07-29T18:52:00+09:00', 'region': 'Adjacent Sea of Tokara Islands', 'magnitude': '2.8', 'max intensity': '2'}, {'observed': '2021-07-29T15:16:00+09:00', 'region': 'Aleutian Islands', 'magnitude': '8.2', 'max intensity': ''}, {'observed': '2021-07-29T16:17:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.1', 'max intensity': '1'}]

编辑:提取特定区域的数据:

vals = [i for i in result if 'Ibaraki Prefecture' in i['region']]

输出:

[{'observed': '2021-07-29T16:17:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.1', 'max intensity': '1'}, {'observed': '2021-07-28T01:52:00+09:00', 'region': 'Southern Ibaraki Prefecture', 'magnitude': '3.4', 'max intensity': '1'}, {'observed': '2021-07-28T00:55:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.5', 'max intensity': '3'}, {'observed': '2021-07-28T00:55:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.5', 'max intensity': ''}, {'observed': '2021-07-27T13:39:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.0', 'max intensity': '1'}, {'observed': '2021-07-23T09:59:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '3.8', 'max intensity': '1'}, {'observed': '2021-07-23T03:15:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '3.7', 'max intensity': '2'}, {'observed': '2021-07-20T15:56:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '3.6', 'max intensity': '1'}, {'observed': '2021-07-15T05:17:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '3.1', 'max intensity': '1'}, {'observed': '2021-07-04T15:35:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.2', 'max intensity': '3'}, {'observed': '2021-07-04T15:35:00+09:00', 'region': 'Off the Coast of Ibaraki Prefecture', 'magnitude': '4.2', 'max intensity': ''}]

编辑:要通过提供的代理从受限网络环境发送请求,您可以通过 proxiesrequests.get 参数传递代理信息:

import requests
proxies = {"https":"https://10.10.1.11:1080"} #example of https proxy
data = json.loads(requests.get('https://www.jma.go.jp/bosai/quake/data/list.json?__time__=202107300300',
       proxies = proxies).text)

或者,您可以使用诸如 selenium 之类的浏览器操作工具来加载页面并使用 BeautifulSoup 来解析页面源:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.jma.go.jp/bosai/map.html#8/36.412/140.021/&elem=int&contents=earthquake_map&lang=en')
tbl = soup(d.page_source, 'html.parser').select_one('.contents-block table')
hds = [i.get_text(strip=True) for i in tbl.select('tr.contents-title th')]
full_table = [dict(zip(hds, [i.get_text(strip=True) for i in b.select('td')])) for b in tbl.select('tr.contents-title ~ tr')]

输出(full_table 的前十行):

[{'Observed at (JST)': '09:37 JST 02 Aug. 2021', 'Region name': 'Eastern Shimane Prefecture', 'Depth': '10 km', 'Magnitude': '4.3', 'Maximumseismicintensity(JMASeismicIntensity)': '4'}, {'Observed at (JST)': '00:56 JST 02 Aug. 2021', 'Region name': 'Southern Miyagi Prefecture', 'Depth': '10 km', 'Magnitude': '2.3', 'Maximumseismicintensity(JMASeismicIntensity)': '1'}, {'Observed at (JST)': '22:50 JST 01 Aug. 2021', 'Region name': 'Hamadori, Fukushima Prefecture', 'Depth': '100 km', 'Magnitude': '3.9', 'Maximumseismicintensity(JMASeismicIntensity)': '2'}, {'Observed at (JST)': '16:53 JST 01 Aug. 2021', 'Region name': 'Adjacent Sea of Okinawa Main Island', 'Depth': '30 km', 'Magnitude': '4.1', 'Maximumseismicintensity(JMASeismicIntensity)': '2'}, {'Observed at (JST)': '12:18 JST 01 Aug. 2021', 'Region name': 'Adjacent Sea of\u200b Miyakojima Island', 'Depth': '30 km', 'Magnitude': '3.9', 'Maximumseismicintensity(JMASeismicIntensity)': '1'}, {'Observed at (JST)': '09:44 JST 01 Aug. 2021', 'Region name': 'Southern Nagano Prefecture', 'Depth': '10 km', 'Magnitude': '2.1', 'Maximumseismicintensity(JMASeismicIntensity)': '1'}, {'Observed at (JST)': '02:18 JST 01 Aug. 2021', 'Region name': 'Off the east Coast of Hokkaido', 'Depth': '30 km', 'Magnitude': '3.6', 'Maximumseismicintensity(JMASeismicIntensity)': '1'}, {'Observed at (JST)': '20:04 JST 31 Jul. 2021', 'Region name': 'Off the Coast of Iwate Prefecture', 'Depth': '50 km', 'Magnitude': '3.4', 'Maximumseismicintensity(JMASeismicIntensity)': '1'}, {'Observed at (JST)': '14:26 JST 31 Jul. 2021', 'Region name': 'Southern Sorachi Region, Hokkaido', 'Depth': '180 km', 'Magnitude': '5.0', 'Maximumseismicintensity(JMASeismicIntensity)': '2'}, {'Observed at (JST)': '13:09 JST 31 Jul. 2021', 'Region name': 'Southern Tokushima Prefecture', 'Depth': '50 km', 'Magnitude': '4.5', 'Maximumseismicintensity(JMASeismicIntensity)': '3'}]