HTML文档中所有元素名称的列表 - beautifulsoup

时间:2016-07-08 16:21:22

标签: python web-scraping beautifulsoup

我想获得一个包含HTML文档的所有不同标记名称的列表(不重复的标记名称字符串列表)。我尝试用soup.findall()添加空条目,但这给了我整个文档。

有办法吗?

1 个答案:

答案 0 :(得分:5)

使用soup.findall(),您可以获得可以迭代的每个元素的列表。因此,您可以执行以下操作:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""  # an html sample
soup = BeautifulSoup(html_doc, 'html.parser')

document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)


代码段的输出为:

>>> ['head', 'title', 'body', 'p', 'b', 'a']


修改

正如@PM 2Ring指出的那样,如果你不关心添加元素的顺序(正如他所说,我不认为是这种情况),那么你可以使用集。在Python 3.x中,您不必导入它,但如果您使用旧版本,则可能需要检查它是否受支持。

from bs4 import BeautifulSoup

...

el = {x.name for x in document} # use a set comprehension to generate it easily
el.add("html")  # only if you need to