Question

我想要运行我的HTML源代码，并在那里提取所有标签和文本，但没有他们的孩子。

例如这个HTML：

<html>
<head>
<title>title</title>
</head>
<body>
Hello world
</body>
</html>

当我尝试拨打soup.find_all()或soup.descendants时，我的返回值是：

<html><head><title>title</title></head><body>Hello world</body></html>
<head><title>title</title></head>
<title>title</title>
title
<body>Hello world</body>
Hello World

当我看到的是每个标签分开时，没有他的后代：

<html>
<head>
<title>
title
<body>
Hello World

我该怎么做？

Answer 1

想法是迭代所有节点。对于没有子元素的人，请获取文本：

for elm in soup():  # soup() is equivalent to soup.find_all()
    if not elm():  # elm() is equivalent to elm.find_all()
        print(elm.name, elm.get_text(strip=True))
    else:
        print(elm.name)

打印：

html
head
title title
body Hello world

如何在没有孩子的情况下使用BeautifulSoup获取HTML代码中的所有标签？

1 个答案: