Question

我试图创建这个可以从新闻文章中提取主标题的网络剪贴簿。

#  -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

url= input('enter the url \n')

r = requests.get(url)
content = r.content
soup = BeautifulSoup(content, "html.parser")
heading = soup.find_all('h1')
print(heading)
print(str.strip(heading[0].text))

这仅适用于h1标签中的标题，但会为h2或h3标签中的标题引发错误。如何修改此代码以使其适用于h2和h3标记？在此先感谢！

Answer 1

BeautifulSoup非常灵活，只需传入您想要找到的list of tag names：

soup.find_all(['h1', 'h2', 'h3'])

你甚至可以这样做：

import re

soup.find_all(re.compile(r"^h\d$"))  # would match "h" followed by a single digit

如何从新闻文章中提取h2和h3标题

1 个答案: