Question

我正在尝试从地图上的每个弹出窗口获取数据。我过去使用过beautifulsoup，但这是第一次从交互式地图中获取数据。

任何朝正确方向的推动都是有帮助的。到目前为止，我正在退缩。这就是我所拥有的，并不重要...

from bs4 import BeautifulSoup as bs4
import requests

url = 'https://www.oaklandconduit.com/development_map'
r = requests.get(url).text
soup = bs4(r, "html.parser")
address = soup.find_all("div", {"class": "leaflet-pane leaflet-marker-pane"})

已更新 关于建议，我使用以下脚本通过re解析了javascript内容。但是加载到json中会返回错误

import requests, re
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default/mapscript.js'
r = requests.get(url).content
content = re.findall(r'var.*?=\s*(.*?);', r, re.DOTALL | re.MULTILINE)[2]
json_content = json.loads(content)

Answer 1

交互式地图是通过JavaScript加载并由JavaScript驱动的，因此，使用requests库不足以获取所需数据，因为它只能获取初始响应（在这种情况下， HTML源代码。）

如果您查看网页的源代码（在Chrome：view-source:https://www.oaklandconduit.com/development_map上），您会看到像这样的一个空div：

<div id='map'></div>

这是地图的占位符div。

您将要使用一种允许加载地图并以编程方式与之交互的方法。 Selenium可以为您做到这一点，但比requests慢得多，因为它必须通过启动程序驱动的浏览器来实现这种交互性。

Answer 2

继续使用正则表达式将地图内容解析为Json。如果对他人有帮助，以下是我的评论方法。

import re, requests, json
url = 'https://ebrrd.nationbuilder.com/themes/3/58597f55b92871671e000000/0/attachments/14822603711537993218/default' \
      '/mapscript.js'
r = requests.get(url).content
# use regex to get geoJSON and replace single quotes with double
content = re.findall(r'var geoJson.*?=\s*(.*?)// Add custom popups', r, re.DOTALL | re.MULTILINE)[0].replace("'", '"')
# add quotes to key: "type" and remove trailing tab from value: "description"
content = re.sub(r"(type):", r'"type":', content).replace('\t', '')
# remove ";" from dict
content = content[:-5]
json_content = json.loads(content)

还对其他pythonic方法开放。

从互动地图上抓取数据

2 个答案: