Question

在Python 3中，您如何在Hello, world!之外的标题标记之间取用字符串，例如打印<h1>Hello, world!</h1>：

import urllib
from urllib.request import urlopen

#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")

webPage = urllib.request.urlopen(userAddress)

list = []

while webPage != "":
    webPage.read()
    list.append()

Answer 1

您需要 HTML Parser 。例如，BeautifulSoup：

from bs4 import BeautifulSoup

soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))

演示：

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly

除了python附带的内容之外，我不允许使用任何其他库。 python是否具有解析HTML的能力，虽然效率较低？

如果由于某种原因，您不允许使用第三方，则可以使用built-in html.parser module。有些人还使用regular expressions来解析HTML。这并不总是坏事，但你必须非常小心，请参阅：

RegEx match open tags except XHTML self-contained tags

Answer 2

绝对HTMLParser是你处理这个问题的最好朋友。

已存在相关的question并满足您的需求。

文本之间的python list.append

2 个答案: