我是BeautifulSoup4的新手,无法从以下代码的html响应中提取纬度和经度值。
url = 'http://cinematreasures.org/theaters/united-states?page=1'
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll("tr")
print links
此代码多次打印出此响应。
<tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965}
完整回复
<tr>\n
<th id="theater_name"><a href="/theaters/united-states?sort=name&order=desc">\u2191 Name</a>
</th>\n
<th id="theater_location"><a href="/theaters/united-states?sort=location&order=asc">Location</a>
</th>\n
<th id="theater_status"><a href="/theaters/united-states?sort=open&order=desc">Status</a>
</th>\n
<th id="theater_screens"><a href="/theaters/united-states?sort=screens&order=asc">Screens</a>
</th>\n</tr>,
<tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965}, category: 'open'}">\n
<td class="name">\n
<a class="map-link" href="/theaters/8775">
<img alt="112 Drive-In" height="48" src="http://photos.cinematreasures.org/production/photos/22137/1313612883/thumb.JPG?1313612883" width="48" />
</a>\n<a class="map-link" href="/theaters/8775">112 Drive-In</a>\n
<div class="info-box">\n
<div class="photo" style="float: left;">
<a href="/theaters/8775">
<img alt="thumb" height="48" src="http://photos.cinematreasures.org/production/photos/22137/1313612883/thumb.JPG?1313612883" width="48" />
</a>
</div>\n
<p style="min-width: 200px !important;">\n<strong><a href="/theaters/8775">112 Drive-In</a></strong>\n
<br>\n 3352 Highway 112 North
<br>Fayetteville, AR 72702
<br>United States
<br>479.442.4542
<br>\n</br>
</br>
</br>
</br>
</br>
</p>\n</div>\n</td>\n
<td class="location">\n Fayetteville, AR, United States\n</td>\n
<td class="status">\n Open\n</td>\n
<td class="screens">\n 1\n</td>\n</tr>
我如何从这个响应中获取lng和lat值?
提前谢谢。
答案 0 :(得分:2)
这是我的方法:
import requests
import demjson
from bs4 import BeautifulSoup
url = 'http://cinematreasures.org/theaters/united-states?page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text)
to_plain_coord = lambda d: (d['point']['lng'], d['point']['lat'])
# Grabbing theater coords if `data` attribute exists
coords = [
to_plain_coord(demjson.decode(t.attrs['data']))
for t in soup.select('.theater')
if 'data' in t.attrs]
print(coords)
我没有使用任何字符串操作。相反,我从data
属性加载JSON。不幸的是,这里的JSON不是很有效,所以我使用demjson
库进行json解析。
pip install demjson
答案 1 :(得分:1)
好的,所以你正确地抓住了所有<tr>
,现在我们只需要从每个{get}获取数据属性。
import re
import requests
from bs4 import BeautifulSoup
url = 'http://cinematreasures.org/theaters/united-states?page=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
theaters = soup.findAll("tr", class_="theater")
data = [ t.get('data') for t in theaters if t.get('data') ]
print data
不幸的是,这会给你一个字符串列表,而不是一个人们所希望的字典对象。我们可以在数据字符串上使用正则表达式将它们转换为dicts(感谢RootTwo):
coords = []
for d in data:
c = dict(re.findall(r'(lat|lng):\s*(-?\d{1,3}\.\d+)', d))
coords.append(c)
答案 2 :(得分:-1)
如果您只期待一次回复,请执行以下操作:
print links[0]