我尝试使用BeautifulSoup
抓取以下页面(例如1,2),以获取从曼谷的一个地方前往另一个地方的行动清单。
基本上,我可以查询并选择行程的描述如下。
url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en'
route_request = requests.get(url)
soup_route = BeautifulSoup(route_request.content, 'lxml')
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'})
descriptions
的HTML如下所示
<div id="routeDescription">
...
<br/>
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b>
<br/>
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>
...
</div>
基本上,我尝试获取行动和公交线路列表前往下一个位置(问题已更新,但仍未解决)。
route_descrtions = []
for description in descriptions.find_all('img'):
action = description.next_sibling
to_station = action.next_sibling
n = action.find_next_siblings('a')
if 'travel' in action.lower():
lines = [to_station.find_next('b').text] + [a.contents[0] for a in n]
else:
lines = []
desp = {'action': action,
'to': to_station.text,
'lines': lines}
route_descrtions.append(desp)
但是,我不知道如何在每个操作(Travel to
操作)后循环浏览链接并附加到我的列表中。我尝试了find_next('a')
和find_next_siblings('a')
,但没有完成任务。
输出
[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
{'action': 'Travel to ',
'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'],
'to': 'Si Phraya'},
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'},
{'action': 'Travel to ',
'lines': ['16', '40', '48', '501', '508'],
'to': 'Siam'},
{'action': 'Travel to ',
'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'],
'to': 'Asok'},
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]
所需的输出
[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
{'action': 'Travel to ',
'lines': ['Chao Phraya Express Boat'],
...
答案 0 :(得分:1)
以下内容应该有效:
from bs4 import BeautifulSoup
import requests
import pprint
url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en'
route_request = requests.get(url)
soup_route = BeautifulSoup(route_request.content, 'lxml')
routes = soup_route.find('div', attrs={'id': 'routeDescription'})
parsed_routes = list()
for img in routes.find_all('img'):
action = img.next_sibling
to_station = action.next_sibling
links = list()
for sibling in img.next_siblings:
if sibling.name == 'a':
links.append(sibling)
elif sibling.name == 'img':
break
lines = list()
if 'travel' in action.lower():
lines.extend([to_station.find_next('b').text])
lines.extend([link.contents[0] for link in links])
parsed_route = {'action': action, 'to': to_station.text, 'lines': lines}
parsed_routes.append(parsed_route)
pprint.pprint(parsed_routes)
输出:
[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
{'action': 'Travel to ',
'lines': ['Chao Phraya Express Boat'],
'to': 'Si Phraya'},
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'},
{'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'},
{'action': 'Travel to ',
'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'],
'to': 'Asok'},
{'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]
您的关键问题是n = action.find_next_siblings('a')
,因为它在您的“当前”图片之后获得了同一级别的所有链接。看到所有图像和所有链接都处于同一级别,这不是您想要的。
您可能将图像视为链接的父节点。类似的东西:
然而,实际上它更像是以下内容:
当您询问图像时,您获得了img1,img2和img3(在此示例中)。当你要求所有下一个链接兄弟姐妹时,你就得到了。所以,如果你在img2,并且要求下一个链接兄弟姐妹你得到它们,即,
我希望这可以解释。我所做的改变只是循环,直到你找到一个图像并停在那里。因此,您的外部图像循环从那里继续。我还清理了一些代码。为了清楚起见。
答案 1 :(得分:0)
您可以尝试find_next_siblings
(使用Python 2.7):
import bs4
text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x'''
soup = bs4.BeautifulSoup(text, 'lxml')
img = soup.find('img')
action = img.next_sibling
to_station = action.next_sibling
n = to_station.find_next_siblings('a')
d = {
'action': action,
'to': to_station.text,
'buses': [a.contents[0] for a in n]
}
结果:
{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']}