Question

我需要使用Python从父标记（无论子标记如何）中提取数据。从下面的代码中，我需要获取“嗨，这是父标签”，而无需获取“嗨，这是子标签”。我该怎么办？

<html>
    <div>
        "Hi, this is parent tag"
        <span> "Hi, this is child tag" </span>
    </div>
</html>

Answer 1

from bs4 import BeautifulSoup

txt = """
<html>
    <div>
        "Hi, this is parent tag"
        <span> "Hi, this is child tag" </span>
    </div>
</html>
"""

soup = BeautifulSoup(txt)

for node in soup.findAll('div'):
    print(' '.join(node.findAll(text=True, recursive=False)))

输出：

“嗨，这是父标签”

Answer 2

您可以使用lxml包xpath语法

txt = """
<html>
    <div>
        "Hi, this is parent tag"
        <span> "Hi, this is child tag" </span>
    </div>
</html>
"""

from lxml.html.soupparser import fromstring
tree = fromstring(txt)
print tree.xpath("//div/text()")

良好的来源提示 https://devhints.io/xpath

如何使用python从Parent标签获取数据

2 个答案: