Question

因此，此代码占用一个网站，并将所有标题信息添加到列表中。如何修改列表，以便在程序打印时，它会在单独的行中显示列表的每一部分，并删除标题标记？

from urllib.request import urlopen
address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen (address)

encoding = "utf-8"

list = []

for line in webPage:
    findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>')
    line = str(line, encoding)
    for startHeader in findHeader:        
        endHeader = '</'+startHeader[1:]
        if (startHeader in line) and (endHeader in line):
            content = line.split(startHeader)[1].split(endHeader)[0]
            list.append(line)
            print (list)

webPage.close()

Answer 1

如果您不介意使用第三方软件包，请尝试使用BeautifulSoup将html转换为纯文本。获得列表后，可以从循环中删除print (list)并执行此操作：

for e in list:
    # .rstrip() to remove trailing '\r\n'
    print(BeautifulSoup(e.rstrip(), "html.parser").text)

但不要忘记先导入BeautifulSoup：

from bs4 import BeautifulSoup

我假设您在运行此示例之前安装了bs4（pip3 install beautifulsoup4）。

此外，您可以使用正则表达式来删除html标记。但它可能比使用像bs这样的html-parses更加冗长和容易出错。

Answer 2

抱歉，不明白你想做什么。

但是，例如，您可以轻松地收集dict中的所有唯一标头：

from urllib.request import urlopen
import re

address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen(address)

# get page content
response = str(webPage.read(), encoding='utf-8')

# leave only <h*> tags content
p = re.compile(r'<(h[0-9])>(.+?)</\1>', re.IGNORECASE | re.DOTALL)
headers = re.findall(p, response)

# headers dict
my_headers = {}

for (tag, value) in headers:
    if tag not in my_headers.keys():
        my_headers[tag] = []

    # remove all tags inside
    re.sub('<[^>]*>', '', value)

    # replace few special chars
    value = value.replace('&lt;', '<')
    value = value.replace('&gt;', '>')

    if value not in my_headers[tag]:
        my_headers[tag].append(value)

# output
print(my_headers)

输出：

{'h2': ['The HTML <head> Element', 'Omitting <html> and <body>?', 'Omitting <head>', 'The HTML <title> Element', 'The HTML <style> Element', 'The HTML <link> Element', 'The HTML <meta> Element', 'The HTML <script> Element', 'The HTML <base> Element', 'HTML head Elements', 'Your Suggestion:', 'Thank You For Helping Us!'], 'h4': ['Top 10 Tutorials', 'Top 10 References', 'Top 10 Examples', 'Web Certificates'], 'h1': ['HTML <span class="color_h1">Head</span>'], 'h3': ['Example', 'W3SCHOOLS EXAMS', 'COLOR PICKER', 'SHARE THIS PAGE', 'LEARN MORE:', 'HTML/CSS', 'JavaScript', 'HTML Graphics', 'Server Side', 'Web Building', 'XML Tutorials', 'HTML', 'CSS', 'XML', 'Charsets']}

Answer 3

您要求没有标题标记的结果。您已在content变量中包含这些值，但不是将content添加到结果列表中，而是添加line，这是整个原始行。

接下来，您要求在新行上打印每个项目。为此，首先删除循环中的print语句。每次添加一个结果时，都会打印整个列表。接下来，在程序底部添加新代码，外部所有循环：

for item in list:
    print(item)

但是，您在HTML中标识标头的技术不是很强大。它期望成对的开始和结束标签在一条线上。它还希望在一行上只有一个类型的标题。它希望每个开始标记都有一个匹配的结束标记。您不能依赖任何这些内容，即使在有效的 HTML中。

Vrs's answer在正确的轨道上建议使用Beautiful Soup，但不是仅仅用于从结果中删除标记，您实际上也可以使用它来查找结果。请考虑以下代码：

from bs4 import BeautifulSoup
from urllib.request import urlopen

address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen(address)

# The list of tag names we want to find
# Just the names, not the angle brackets    
findHeader = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6')

soup = BeautifulSoup(webPage, 'html.parser')
headers = soup.find_all(findHeader)
for header in headers:
    print(header.get_text())

find_all方法接受标记名称列表，并返回表示文档顺序中每个结果的Tag个对象。我们将列表存储在headers中，并打印每个列表的文本。 get_text方法仅显示标记的文本部分，不仅省略了周围的标题标记，还省略了任何嵌入的标记。（例如，您正在抓取的页面中有一些嵌入的span标记。）

如何省略此代码中的<h>标签？

3 个答案: