Question

我尝试使用BeautifulSoup抓取以下页面（例如1，2），以获取从曼谷的一个地方前往另一个地方的行动清单。

基本上，我可以查询并选择行程的描述如下。

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en'
route_request = requests.get(url)
soup_route = BeautifulSoup(route_request.content, 'lxml')
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'})

descriptions的HTML如下所示

<div id="routeDescription">
...
<br/>
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b>
<br/>
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>
...
</div>

基本上，我尝试获取行动和公交线路列表前往下一个位置（问题已更新，但仍未解决）。

route_descrtions = []
for description in descriptions.find_all('img'):
    action = description.next_sibling
    to_station = action.next_sibling
    n = action.find_next_siblings('a')
    if 'travel' in action.lower():
        lines = [to_station.find_next('b').text] +  [a.contents[0] for a in n]
    else:
        lines = []
    desp = {'action': action,
            'to': to_station.text,
            'lines': lines}
    route_descrtions.append(desp)

但是，我不知道如何在每个操作（Travel to操作）后循环浏览链接并附加到我的列表中。我尝试了find_next('a')和find_next_siblings('a')，但没有完成任务。

输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
 {'action': 'Travel to ',
  'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'],
  'to': 'Si Phraya'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'},
 {'action': 'Travel to ',
  'lines': ['16', '40', '48', '501', '508'],
  'to': 'Siam'},
 {'action': 'Travel to ',
  'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'],
  'to': 'Asok'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

所需的输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
 {'action': 'Travel to ',
  'lines': ['Chao Phraya Express Boat'],
 ...

Answer 1

以下内容应该有效：

from bs4 import BeautifulSoup
import requests
import pprint

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en'
route_request = requests.get(url)
soup_route = BeautifulSoup(route_request.content, 'lxml')
routes = soup_route.find('div', attrs={'id': 'routeDescription'})

parsed_routes = list()
for img in routes.find_all('img'):
    action = img.next_sibling
    to_station = action.next_sibling
    links = list()
    for sibling in img.next_siblings:
        if sibling.name == 'a':
            links.append(sibling)
        elif sibling.name == 'img':
            break

    lines = list()
    if 'travel' in action.lower():
        lines.extend([to_station.find_next('b').text])
        lines.extend([link.contents[0] for link in links])

    parsed_route = {'action': action, 'to': to_station.text, 'lines': lines}
    parsed_routes.append(parsed_route)

pprint.pprint(parsed_routes)

输出：

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
 {'action': 'Travel to ',
  'lines': ['Chao Phraya Express Boat'],
  'to': 'Si Phraya'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'},
 {'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'},
 {'action': 'Travel to ',
  'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'],
  'to': 'Asok'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

您的关键问题是n = action.find_next_siblings('a')，因为它在您的“当前”图片之后获得了同一级别的所有链接。看到所有图像和所有链接都处于同一级别，这不是您想要的。

您可能将图像视为链接的父节点。类似的东西：

IMG1
- LINK1
IMG2
- LINK2
IMG3
- LINK3
- LINK4
- link5

然而，实际上它更像是以下内容：

IMG1
LINK1
IMG2
LINK2
IMG3
LINK3
LINK4
link5

当您询问图像时，您获得了img1，img2和img3（在此示例中）。当你要求所有下一个链接兄弟姐妹时，你就得到了。所以，如果你在img2，并且要求下一个链接兄弟姐妹你得到它们，即，

IMG1
LINK1
img2＆lt;你来了，得到......
link2 ＆lt;此，
img3 - （不是这个，因为它不是链接）
link3 ＆lt;此，
link4 ＆lt;这个，和
link5 ＆lt;此

我希望这可以解释。我所做的改变只是循环，直到你找到一个图像并停在那里。因此，您的外部图像循环从那里继续。我还清理了一些代码。为了清楚起见。

Answer 2

您可以尝试find_next_siblings（使用Python 2.7）：

import bs4

text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x'''

soup = bs4.BeautifulSoup(text, 'lxml')
img = soup.find('img')
action = img.next_sibling
to_station = action.next_sibling
n = to_station.find_next_siblings('a')
d = {
    'action': action,
    'to': to_station.text,
    'buses': [a.contents[0] for a in n]
}

结果：

{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']}

BeautifulSoup在给定标记

2 个答案: