查找最近的链接BeautifulSoup(python)

时间:2012-08-02 11:06:06

标签: python beautifulsoup lxml

我正在做一个小项目,我在报纸中提取政治领导人的出现。有时会提到一位政治家,并且没有父母或孩子有链接。 (因为我觉得语义不好的标记)。

所以我想创建一个可以找到最近的链接的函数,然后提取它。在下面的情况下,搜索字符串为Rasmussen,我想要的链接为:/307046

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find)) 

def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    and only X levels up down. These features are not implemented yet.
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""

    # Is the nearest link readily available?
    # If so - this works and extracts the link.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            link = artikel_link.get('href')
            # sometimes the link is a relative link - sometimes it is not
            if ("http" or "www") not in link:
                link = url+link
                return link
    # But if the link is not readily available, we will go up
    # This is (I think) where it goes wrong
    # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056> 
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)

# run it
if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")

以下直接电话有效:

print contexts[0].parent.parent.parent.a['href'].encode('utf-8')

作为参考,整个抱歉代码在bitbucket:https://bitbucket.org/achristoffersen/politikere-i-medierne

(p.s。使用BeautifullSoup 4)


编辑:SimonSapin要求我定义最近的:通过最近我指的是距离搜索项最少的嵌套级别的链接,在任一方向上。在上面的文本中,基于drupal的报纸网站生成的a href既不是找到搜索字符串的标记的直接父级或子级。所以BeautifullSoup找不到它。

我怀疑“最少的字谜”会经常发挥作用。在这种情况下,一个可以与find和rfind一起被黑客攻击 - 但我真的想通过BS这样做。由于这可行:contexts[0].parent.parent.parent.a['href'].encode('utf-8')必须可以将其推广到脚本。

编辑:也许我应该强调我正在寻找一个BeautifulSoup解决方案。根据@ erik85的建议,将BS与自定义/简单的第一次搜索结合起来很快就会变得混乱。

2 个答案:

答案 0 :(得分:12)

有人可能会想出一个适用于复制和粘贴的解决方案,您会认为这可以解决您的问题。但是你的问题不是代码!这是你的策略。有一种称为“分而治之”的软件设计原则,您应该在重新设计代码时应用它:将搜索最近节点(可能是breadth-first-search)的HTML /字符串解释为树/图形的代码。您不仅会学习设计更好的软件,您的问题可能会不复存在

我认为你很聪明,可以自己解决这个问题,但我也想提供一个骨架:

def parse_html(txt):
    """ reads a string of html and returns a dict/list/tuple presentation"""
    pass

def breadth_first_search(graph, start, end):
    """ finds the shortest way from start to end
    You can probably customize start and end to work well with the input you want
    to provide. For implementation details see the link in the text above.
    """
    pass

def find_nearest_link(html,name):
    """putting it all together"""
    return breadth_first_search(parse_html(html),name,"link")
PS:这样做也适用另一个原则,但是从数学:假设存在问题你不知道解决方案(找到接近所选子字符串的链接)并且有一组问题你知道解决方案到(图遍历),然后尝试转换您的问题以匹配您可以解决的问题组,这样您就可以使用基本的解决方案模式(甚至可能已经在您选择的语言/框架中实现)并且您已完成

答案 1 :(得分:2)

这是使用lxml的解决方案。主要思想是找到所有前面和后面的元素,然后通过这些元素进行roundrobin迭代:

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

使用BeautifulSoups'(或bs4's)next_elements and previous_elements的类似解决方案也应该是可行的。


import lxml.html as LH
import itertools

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # http://docs.python.org/library/itertools.html#recipes
    # Author: George Sakkis
    pending = len(iterables)
    nexts = itertools.cycle(iter(it).next for it in iterables)
    while pending:
        try:
            for n in nexts:
                yield n()
        except StopIteration:
            pending -= 1
            nexts = itertools.cycle(itertools.islice(nexts, pending))

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
doc = LH.fromstring(tekst)

for x in doc.xpath('//*[contains(text(),{s!r})]'.format(s = to_find)):
    print(find_nearest(x))

产量

/307046