Question

示例：

有时HTML是：

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>

其他时间只是：

<div id="1">
    this is the text i want here
</div>

我想只获取一个标记中的文本，并忽略所有其他子标记。如果我运行.text属性，我会同时运行。

Answer 1

已更新以使用更通用的方法（请参阅编辑历史记录以获取原始答案）：

您可以通过测试它们是NavigableString的实例来提取外部div的子元素。

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

这会产生外部div元素中包含的字符串列表。

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

对于你的第二个例子：

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

这也适用于其他情况，例如外部div的文本元素出现在任何子标记之前，子标记之间，多个文本元素之间或根本不存在。

Answer 2

另一种可能的方法（我会在一个函数中实现）：

def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False表示您只需要直接子项，而不是嵌套项。 text=True表示您只需要文本节点。

用法示例：

from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here

获取没有内部子标记文本的HTML标记文本

2 个答案: