如何从Python中的某个HTML节点获取仅可见文本?
假设我有一个这样的节点:
<span>
<style>.vAnH{display:none}.vsP6{display:inline}</style>
<span class="vAnH">34</span>
<span />
<span style="display: inline">111</span>
<span style="display:none">120</span>
<span class="vAnH">120</span>
<div style="display:none">120</div>
<span class="78">.</span>
<span class="vAnH">100</span>
<div style="display:none">100</div>
161
<span style="display: inline">.</span>
<span class="174">126</span>
<span class="vAnH">159</span>
<div style="display:none">159</div>
<span />
<span class="vsP6">.</span>
<span style="display:none">5</span>
<span class="vAnH">5</span>
<div style="display:none">5</div>
<span style="display:none">73</span>
<span class="vAnH">73</span>
<div style="display:none">73</div>
<span class="221">98</span>
<span style="display:none">194</span>
<div style="display:none">194</div>
</span>
是否有任何第三方库可以执行此操作,还是应该手动解析?
答案 0 :(得分:1)
有多种方法可以让最终用户在浏览器中显示/隐藏节点。 BeautifulSoup
是一个HTML解析器,它不知道是否会显示一个元素。虽然,这里有一次尝试:
例如,如果某个元素被CSS规则隐藏,但它可能适用于您的用例,则无效。
最简单的选择是切换到selenium
。 .text
此处仅返回元素的可见文本:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://domain.com')
element = driver.find_element_by_id('id_of_an_element')
print(element.text)
答案 1 :(得分:1)
如果你不想采用Selenium方式,你可以通过BeautifulSoup获得一些东西:
from bs4 import BeautifulSoup
def is_visible_span_or_div(tag, is_parent=False):
""" This function checks if the element is a span or a div,
and if it is visible. If so, it recursively checks all the parents
and returns False is one of them is hidden """
# loads the style attribute of the element
style = tag.attrs.get('style', False)
# checks if element is div or span, if it's not a parent
if not is_parent and tag.name not in ('div', 'span'):
return False
# checks if the element is hidden
if style and ('hidden' in style or 'display: none' in style):
return False
# makes a recursive call to check the parent as well
parent = tag.parent
if parent and not is_visible_span_or_div(parent, is_parent=True):
return False
# neither the element nor its parent(s) are hidden, so return True
return True
html = """
<span style="display: none;">I am not visible</span>
<span style="display: inline">I am visible</span>
<div style="display: none;">
<span>I am a visible span inside a hidden div</span>
</div>
"""
soup = BeautifulSoup(html)
visible_elements = soup.find_all(is_visible_span_or_div)
print(visible_elements)
请注意,它不会完全反映浏览器显示或隐藏元素的方式,因为其他因素可能会决定元素的可见性(例如宽度,高度,不透明度,窗外绝对定位......)。
尽管如此,这个脚本非常可靠,因为它会递归检查所有元素的父节点,并在找到隐藏的父节点后立即返回False。
我看到这个函数的唯一问题是它有相当大的开销,因为它必须检查每个元素的所有父元素,即使这些元素碰巧只是在DOM树中放在一边。它可以很容易地进行优化,但可能会以可读性为代价。
答案 2 :(得分:0)
您需要编写自定义过滤功能。一个工作的例子:
from bs4 import BeautifulSoup
import re
data = '''<span>
<style>.vAnH{display:none}.vsP6{display:inline}</style>
<span class="vAnH">34</span>
<span />
<span style="display: inline">111</span>
<span style="display:none">120</span>
<span class="vAnH">120</span>
<div style="display:none">120</div>
<span class="78">.</span>
<span class="vAnH">100</span>
<div style="display:none">100</div>
161
<span style="display: inline">.</span>
<span class="174">126</span>
<span class="vAnH">159</span>
<div style="display:none">159</div>
<span />
<span class="vsP6">.</span>
<span style="display:none">5</span>
<span class="vAnH">5</span>
<div style="display:none">5</div>
<span style="display:none">73</span>
<span class="vAnH">73</span>
<div style="display:none">73</div>
<span class="221">98</span>
<span style="display:none">194</span>
<div style="display:none">194</div>
</span>'''
soup = BeautifulSoup(data)
no_disp = re.search(r'\.(.+?){display:none}', soup.style.string).group(1)
def find_visible(tag):
return (not tag.name == 'style') and (not no_disp in tag.get('class', '')) and (not 'display:none' in tag.get('style', ''))
for tag in soup.find_all(find_visible, text=True):
print tag.string