如何从Python中的某个HTML节点只获取可见文本

时间:2014-12-26 15:23:37

标签: python html html-parsing

如何从Python中的某个HTML节点获取仅可见文本

假设我有一个这样的节点:

<span>
   <style>.vAnH{display:none}.vsP6{display:inline}</style>
   <span class="vAnH">34</span>
   <span />
   <span style="display: inline">111</span>
   <span style="display:none">120</span>
   <span class="vAnH">120</span>
   <div style="display:none">120</div>
   <span class="78">.</span>
   <span class="vAnH">100</span>
   <div style="display:none">100</div>
   161
   <span style="display: inline">.</span>
   <span class="174">126</span>
   <span class="vAnH">159</span>
   <div style="display:none">159</div>
   <span />
   <span class="vsP6">.</span>
   <span style="display:none">5</span>
   <span class="vAnH">5</span>
   <div style="display:none">5</div>
   <span style="display:none">73</span>
   <span class="vAnH">73</span>
   <div style="display:none">73</div>
   <span class="221">98</span>
   <span style="display:none">194</span>
   <div style="display:none">194</div>
</span>

是否有任何第三方库可以执行此操作,还是应该手动解析?

3 个答案:

答案 0 :(得分:1)

有多种方法可以让最终用户在浏览器中显示/隐藏节点。 BeautifulSoup是一个HTML解析器,它不知道是否会显示一个元素。虽然,这里有一次尝试:

例如,如果某个元素被CSS规则隐藏,但它可能适用于您的用例,则无效。

最简单的选择是切换到selenium.text此处仅返回元素的可见文本:

from selenium import webdriver

driver = webdriver.Firefox() 
driver.get('http://domain.com')

element = driver.find_element_by_id('id_of_an_element')
print(element.text)

答案 1 :(得分:1)

如果你不想采用Selenium方式,你可以通过BeautifulSoup获得一些东西:

from bs4 import BeautifulSoup

def is_visible_span_or_div(tag, is_parent=False):
    """ This function checks if the element is a span or a div,
    and if it is visible. If so, it recursively checks all the parents
    and returns False is one of them is hidden """

    # loads the style attribute of the element
    style = tag.attrs.get('style', False)

    # checks if element is div or span, if it's not a parent
    if not is_parent and tag.name not in ('div', 'span'):
        return False

    # checks if the element is hidden
    if style and ('hidden' in style or 'display: none' in style):
        return False

    # makes a recursive call to check the parent as well
    parent = tag.parent
    if parent and not is_visible_span_or_div(parent, is_parent=True):
        return False

    # neither the element nor its parent(s) are hidden, so return True
    return True

html = """
    <span style="display: none;">I am not visible</span>
    <span style="display: inline">I am visible</span>
    <div style="display: none;">
        <span>I am a visible span inside a hidden div</span>
    </div>
"""

soup = BeautifulSoup(html)

visible_elements = soup.find_all(is_visible_span_or_div)

print(visible_elements)

请注意,它不会完全反映浏览器显示或隐藏元素的方式,因为其他因素可能会决定元素的可见性(例如宽度,高度,不透明度,窗外绝对定位......)。

尽管如此,这个脚本非常可靠,因为它会递归检查所有元素的父节点,并在找到隐藏的父节点后立即返回False。

我看到这个函数的唯一问题是它有相当大的开销,因为它必须检查每个元素的所有父元素,即使这些元素碰巧只是在DOM树中放在一边。它可以很容易地进行优化,但可能会以可读性为代价。

答案 2 :(得分:0)

您需要编写自定义过滤功能。一个工作的例子:

from bs4 import BeautifulSoup
import re

data = '''<span>
   <style>.vAnH{display:none}.vsP6{display:inline}</style>
   <span class="vAnH">34</span>
   <span />
   <span style="display: inline">111</span>
   <span style="display:none">120</span>
   <span class="vAnH">120</span>
   <div style="display:none">120</div>
   <span class="78">.</span>
   <span class="vAnH">100</span>
   <div style="display:none">100</div>
   161
   <span style="display: inline">.</span>
   <span class="174">126</span>
   <span class="vAnH">159</span>
   <div style="display:none">159</div>
   <span />
   <span class="vsP6">.</span>
   <span style="display:none">5</span>
   <span class="vAnH">5</span>
   <div style="display:none">5</div>
   <span style="display:none">73</span>
   <span class="vAnH">73</span>
   <div style="display:none">73</div>
   <span class="221">98</span>
   <span style="display:none">194</span>
   <div style="display:none">194</div>
</span>'''

soup = BeautifulSoup(data)
no_disp = re.search(r'\.(.+?){display:none}', soup.style.string).group(1)

def find_visible(tag):
    return (not tag.name == 'style') and (not no_disp in tag.get('class', '')) and (not 'display:none' in tag.get('style', ''))

for tag in soup.find_all(find_visible, text=True):
    print tag.string