Question

我有一组HTML文件，我想在每个文件中提取第一个标记。由于文件没有特定的标签，它始终是文件中的第一个，我不知道该怎么做。

例如，对于以下代码段，第一个标记为<html>。

<html>
 <head>
    <title>
     insert title here
    </title>
 </head>
</html>

使用BeautifulSoup（或可能是其他工具）实现此目的的任何方法？在此先感谢：）

Answer 1

在这种情况下，您可以使用BeautifulSoup，只需在BeautifulSoup对象上发出find() - 它会找到树中的第一个元素。 .name会为您提供标记名称：

from bs4 import BeautifulSoup

data = """
<html>
 <head>
    <title>
     insert title here
    </title>
 </head>
</html>
"""

soup = BeautifulSoup(data, "html.parser")
print(soup.find().name)

使用BeautifulSoup在HTML文件中查找第一个标记

1 个答案: