Question

我正在尝试从HTML文件中提取h1（或任何标头）标头。

我的python代码如下：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm');
# print(html.read());

# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser');
h2 = bs.find('h2', {'id' : 'toc'});
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));
print(h2);

从上面的代码片段中您可以看到，我尝试提取所有标头，但是我得到的只是一个空列表，而没有。我检查了html文件中的标题，并验证了它们的存在。我也尝试过使用h2 = bs.find('h2', {'class' : 'toc'});

这样的字典

有人可以告诉我我在这里做错了什么吗

Answer 1

运行代码时，我得到以下输出：

[<h1>Introduction to HTML/XHTML</h1>, <h2><a href="index.htm" id="toc-title">Table of Contents</a></h2>, <h2>Example HTML Document</h2>]

我使用的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm').read().decode("utf-8")
# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser')
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]))

urlopen为您提供了一个http.client.HTTPResponse对象，您需要阅读该对象然后将其解码为UTF-8。

此问题可能是-BeautifulSoup HTTPResponse has no attribute encode

的副本

如何使用BeautifulSoup从HTML文件中提取h1标签？

1 个答案: