如何使用BeautifulSoup从网站获取所有标题?

时间:2017-07-12 15:55:34

标签: python web-scraping beautifulsoup python-requests

我正试图从一个简单的网站抓取所有标题。我的尝试:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://nypost.com/business"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
soup.find_all('h')

soup.find_all('h')会返回[],但如果我执行soup.h1soup.h2之类的操作,则会返回相应的数据。我只是错误地调用了这个方法吗?

4 个答案:

答案 0 :(得分:4)

按正则表达式过滤:

params

此正则表达式查找以children_ids = ParentModel.objects.filter(name__startswith='A').values_list('child', flat=True) children = ChildModel.objects.filter(pk__in=children_ids) 开头的所有标记,在soup.find_all(re.compile('^h[1-6]$')) 后面有一个数字,然后在数字后面结束。

答案 1 :(得分:2)

如果您不想使用正则表达式,那么您可能希望执行以下操作:

from bs4 import BeautifulSoup
import requests

url = "http://nypost.com/business"

page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
    print(headlines.text.strip())

结果:

The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says

And So On------

答案 2 :(得分:2)

使用find或find_all方法时,您可以传递字符串或标签列表

soup.find_all([f'h{i}' for i in range(1,7) ])

soup.find_all(['h{}'.format(i) for i in range(1,7)])

答案 3 :(得分:0)

您需要执行soup.find_all('h1')

你可以做点什么:

for a in ["h1","h2"]:
  soup.find_all(a)