解析带有美丽汤的HTML以获得href之后的链接

时间:2016-10-06 16:47:52

标签: python html beautifulsoup html-parsing

这是find_all('a')(非常长)的结果:

</a>, <a class="btn text-default text-dark clear_filters pull-right group-ib" href="#" id="export_dialog_close" title="Cancel"><span class="glyphicon glyphicon-remove"></span><span>Cancel</span></a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:SHIPNAME/direction:asc">Vessel Name</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:TIMESTAMP_UTC/direction:asc">Timestamp</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:PORT_NAME/direction:asc">Port</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:MOVE_TYPE_NAME/direction:asc">Port Call type</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:ELAPSED/direction:asc">Time Elapsed</a>, <a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9" title="View details for: SIDER LUCK">SIDER LUCK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/163/port_name:MILAZZO/_:3525d580eade08cfdb72083b248185a9" title="View details for: MILAZZO">MILAZZO</a>, <a href="/en/ais/details/ships/shipid:288753/imo:9389693/mmsi:249474000/vessel:OOCL%20ISTANBUL/_:3525d580eade08cfdb72083b248185a9" title="View details for: OOCL ISTANBUL">OOCL ISTANBUL</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/17436/port_name:AMBARLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: AMBARLI">AMBARLI</a>, <a href="/en/ais/details/ships/shipid:754480/imo:9045613/mmsi:636014098/vessel:TK%20ROTTERDAM/_:3525d580eade08cfdb72083b248185a9" title="View details for: TK ROTTERDAM">TK ROTTERDAM</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/3504/port_name:DILISKELESI/_:3525d580eade08cfdb72083b248185a9" title="View details for: DILISKELESI">DILISKELESI</a>, <a href="/en/ais/details/ships/shipid:412277/imo:9039585/mmsi:353430000/vessel:SEA%20AEOLIS/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEA AEOLIS">SEA AEOLIS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/1/port_name:PIRAEUS/_:3525d580eade08cfdb72083b248185a9" title="View details for: PIRAEUS">PIRAEUS</a>, <a href="/en/ais/details/ships/shipid:346713/imo:7614599/mmsi:273327300/vessel:SOLIDAT/_:3525d580eade08cfdb72083b248185a9" title="View details for: SOLIDAT">SOLIDAT</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/883/port_name:SEVASTOPOL/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEVASTOPOL">SEVASTOPOL</a>, <a href="/en/ais/details/ships/shipid:752974/imo:9195298/mmsi:636011072/vessel:OCEANPRINCESS/_:3525d580eade08cfdb72083b248185a9" title="View details for: OCEANPRINCESS">OCEANPRINCESS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/21780/port_name:EREGLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: EREGLI">EREGLI</a>, <a href="/en/ais/details/ships/shipid:201260/imo:9385075/mmsi:235102768/vessel:EMERALD%20BAY/_:3525d580eade08cfdb72083b248185a9" title="View details for: EMERALD BAY">EMERALD BAY</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ships/shipid:418956/imo:9102746/mmsi:356579000/vessel:MSC%20DON%20GIOVANNI/_:3525d580eade08cfdb72083b248185a9" title="View details for: MSC DON GIOVANNI">MSC DON GIOVANNI</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/67/port_name:CONSTANTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: CONSTANTA">CONSTANTA</a>, <a href="/en/ais/details/ships/shipid:748395/imo:9460734/mmsi:622121422/vessel:WADI%20SAFAGA/_:3525d580eade08cfdb72083b248185a9" title="View details for: WADI SAFAGA">WADI SAFAGA</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/997/port_name:DAMIETTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: DAMIETTA">DAMIETTA</a>

我想提取以/en/ais/details/ships/shipid:开头的字符串,例如:

<a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9"

我能够复制这些示例(Find specific link w/ beautifulsoupHow to get Beautiful Soup to get link from href and class?),但我宁愿不使用正则表达式。

到目前为止,我有:

for i in ase: #ase is where the html is sotred
    print(i.get('href')) #prints everysingle href. 

简而言之,我的问题是如何在不使用正则表达式的情况下保留具有我感兴趣的字符串的href

3 个答案:

答案 0 :(得分:3)

尝试以下列表理解:

[h.get('href') for h in ase if 'string' in h.get('href', '')] 

这将为您提供一个列表,其中仅包含包含子字符串'string'的链接。

更新

正如 @PadraicCunningham 在评论中指出的那样'string' in h.get('href')(这是我原来回答的一部分)如果TypeError没有h,则会'href'密钥h - 不太可能,因为<a>将代表.get()标记,但肯定也是非平凡的可能性。为了实现这种可能性,当密钥不存在时,您可以简单地传递''默认参数None而不是Settings

另外,我没有声称我的解决方案是最好的;它可能不是特别有效或优雅。但是,根据我对OP问题的理解,这个解决方案可以工作,很简单,并且易于理解。

答案 1 :(得分:3)

@elethan's answer不是最好的。它会找到所有链接,然后将其过滤掉。为什么我们只是直接获取我们需要的链接而没有额外的过滤 - BeautifulSoup非常有能力:

prefix = "/en/ais/details/ships/shipid"
[a["href"] for a in soup("a", href=lambda x: x and x.startswith(prefix))]

或者,您可以传递function来检查字符串&#34;是否以&#34;开头,而不是regular expression pattern。所需的子字符串:

pattern = re.compile(r"^/en/ais/details/ships/shipid")
[a["href"] for a in soup("a", href=pattern)]

^这里表示字符串的开头。

或者,我们甚至可以使用CSS选择器:

[a["href"] for a in soup.select('a[href^="/en/ais/details/ships/shipid"]')]

^=是&#34;起始&#34;选择器。

答案 2 :(得分:0)

我仍然会建议你使用正则表达式,因为它更简洁,并在列表上保存另一个循环。

import re
find_all('a', href=re.compile("/en/ais/details/ships/shipid:"))

documentation中,你找到了类似的解决方案。