Question

假设您有一个网页：

<html>
<head>
<meta name="description" content="Hello World Test">
</head>
<body>
<h1>Hello World!!!</h1>
<p>How are you today?</p>
<p>What have you been up to?</p>
</body>
</html>

是否有一种方法可以循环浏览页面上的节点，然后如果该节点包含文本，则提取文本？

然后我想通过Xpath来组织文本。

因此，上述内容将是：

/ html / body / h1：世界您好！

/ html / body / p [1]：您今天好吗？

/ html / body / p [2]：您在忙什么？

非常感谢

Answer 1

您可以使用lxml库中的XPath来迭代所有HTML节点，并在迭代的节点包含任何文本的情况下使用路径检索内容：

from lxml import html

tree = html.fromstring("""
<html>
 <head>
  <meta content="Hello World Test" name="description"/>
 </head>
 <body>
  <h1>Hello World!!!</h1>
  <p>How are you today?</p>
  <p>What have you been up to?</p>
 </body>
</html>
""")

for node in tree.iter():
    if node.text and node.text.strip():
        print(node.getroottree().getpath(node), node.text)

/ html / body / h1 Hello World !!!

/ html / body / p [1]您好吗？

/ html / body / p [2]您在做什么？

Answer 2

如果您正在使用硒，这是解决方案。

nodes = driver.find_elements_by_xpath("//body/*")
for node in nodes:
    nodepath =''
    nodeText = node.text
    while node.tag_name!='html':
        nodepath = node.tag_name + "/" + nodepath
        node = node.find_element_by_xpath("./..")
    print('html/' + nodepath[0:-1] + ":" + nodeText)

如何从网页中提取内容及其父HTML元素？

2 个答案: