BeautifulSoup在给定标记

时间:2017-04-09 03:57:03

标签: python html beautifulsoup

我尝试使用BeautifulSoup抓取以下页面(例如12),以获取从曼谷的一个地方前往另一个地方的行动清单。

基本上,我可以查询并选择行程的描述如下。

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en'
route_request = requests.get(url)
soup_route = BeautifulSoup(route_request.content, 'lxml')
descriptions = soup_route.find('div', attrs={'id': 'routeDescription'})

descriptions的HTML如下所示

<div id="routeDescription">
...
<br/>
<img src="/images/walk_icon_small.PNG" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Walk by foot to <b>Sanam Luang</b>
<br/>
<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>
...
</div>

基本上,我尝试获取行动和公交线路列表前往下一个位置(问题已更新,但仍未解决)。

route_descrtions = []
for description in descriptions.find_all('img'):
    action = description.next_sibling
    to_station = action.next_sibling
    n = action.find_next_siblings('a')
    if 'travel' in action.lower():
        lines = [to_station.find_next('b').text] +  [a.contents[0] for a in n]
    else:
        lines = []
    desp = {'action': action,
            'to': to_station.text,
            'lines': lines}
    route_descrtions.append(desp)

但是,我不知道如何在每个操作(Travel to操作)后循环浏览链接并附加到我的列表中。我尝试了find_next('a')find_next_siblings('a'),但没有完成任务。

输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
 {'action': 'Travel to ',
  'lines': ['Chao Phraya Express Boat', '40', '48', '501', '508'],
  'to': 'Si Phraya'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'},
 {'action': 'Travel to ',
  'lines': ['16', '40', '48', '501', '508'],
  'to': 'Siam'},
 {'action': 'Travel to ',
  'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'],
  'to': 'Asok'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

所需的输出

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
 {'action': 'Travel to ',
  'lines': ['Chao Phraya Express Boat'],
 ...

2 个答案:

答案 0 :(得分:1)

以下内容应该有效:

from bs4 import BeautifulSoup
import requests
import pprint

url = 'http://www.transitbangkok.com/showBestRoute.php?from=Sutthawat+-+Arun+Amarin+Intersection&to=Sukhumvit&originSelected=true&destinationSelected=true&lang=en'
route_request = requests.get(url)
soup_route = BeautifulSoup(route_request.content, 'lxml')
routes = soup_route.find('div', attrs={'id': 'routeDescription'})

parsed_routes = list()
for img in routes.find_all('img'):
    action = img.next_sibling
    to_station = action.next_sibling
    links = list()
    for sibling in img.next_siblings:
        if sibling.name == 'a':
            links.append(sibling)
        elif sibling.name == 'img':
            break

    lines = list()
    if 'travel' in action.lower():
        lines.extend([to_station.find_next('b').text])
        lines.extend([link.contents[0] for link in links])

    parsed_route = {'action': action, 'to': to_station.text, 'lines': lines}
    parsed_routes.append(parsed_route)

pprint.pprint(parsed_routes)

输出:

[{'action': 'Walk by foot to ', 'lines': [], 'to': 'Wang Lang (Siriraj)'},
 {'action': 'Travel to ',
  'lines': ['Chao Phraya Express Boat'],
  'to': 'Si Phraya'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sheraton Royal Orchid'},
 {'action': 'Travel to ', 'lines': ['16'], 'to': 'Siam'},
 {'action': 'Travel to ',
  'lines': ['BTS - Sukhumvit', '40', '48', '501', '508'],
  'to': 'Asok'},
 {'action': 'Walk by foot to ', 'lines': [], 'to': 'Sukhumvit'}]

您的关键问题是n = action.find_next_siblings('a'),因为它在您的“当前”图片之后获得了同一级别的所有链接。看到所有图像和所有链接都处于同一级别,这不是您想要的。

您可能将图像视为链接的父节点。类似的东西:

  • IMG1
    • LINK1
  • IMG2
    • LINK2
  • IMG3
    • LINK3
    • LINK4
    • link5

然而,实际上它更像是以下内容:

  • IMG1
  • LINK1
  • IMG2
  • LINK2
  • IMG3
  • LINK3
  • LINK4
  • link5

当您询问图像时,您获得了img1,img2和img3(在此示例中)。当你要求所有下一个链接兄弟姐妹时,你就得到了。所以,如果你在img2,并且要求下一个链接兄弟姐妹你得到它们,即,

  • IMG1
  • LINK1
  • img2&lt;你来了,得到......
  • link2 &lt;此,
  • img3 - (不是这个,因为它不是链接)
  • link3 &lt;此,
  • link4 &lt;这个,和
  • link5 &lt;此

我希望这可以解释。我所做的改变只是循环,直到你找到一个图像并停在那里。因此,您的外部图像循环从那里继续。我还清理了一些代码。为了清楚起见。

答案 1 :(得分:0)

您可以尝试find_next_siblings(使用Python 2.7):

import bs4

text = '''<img src="/images/bus_icon_semi_small.gif" style="vertical-align:middle;padding-right: 10px;margin-right: 0px;"/>Travel to <b>Khok Wua</b> using the line(s): <b><a href="lines/bangkok-bus-line/2">2</a></b> or <a href="lines/bangkok-bus-line/15">15</a> or <a href="lines/bangkok-bus-line/44">44</a> or <a href="lines/bangkok-bus-line/47">47</a> or <a href="lines/bangkok-bus-line/59">59</a> or <a href="lines/bangkok-bus-line/201">201</a> or <a href="lines/bangkok-bus-line/203">203</a> or <a href="lines/bangkok-bus-line/512">512</a><br/>x`x'''

soup = bs4.BeautifulSoup(text, 'lxml')
img = soup.find('img')
action = img.next_sibling
to_station = action.next_sibling
n = to_station.find_next_siblings('a')
d = {
    'action': action,
    'to': to_station.text,
    'buses': [a.contents[0] for a in n]
}

结果:

{'action': u'Travel to ', 'to': u'Khok Wua', 'buses': [u'15', u'44', u'47', u'59', u'201', u'203', u'512']}