使用BeautifulSoup提取标题

时间:2016-03-12 09:43:32

标签: python-3.x beautifulsoup

我有这个

from urllib import request
url = "http://www.bbc.co.uk/news/election-us-2016-35791008"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
raw.find_all('title', limit=1)
print (raw.find_all("title"))
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

我想使用BeautifulSoup提取页面标题,但收到此错误

Traceback (most recent call last):
  File "C:\Users\Passanova\AppData\Local\Programs\Python\Python35-32\test.py", line 8, in <module>
    raw.find_all('title', limit=1)
AttributeError: 'str' object has no attribute 'find_all'

请提出任何建议

4 个答案:

答案 0 :(得分:2)

要导航汤,你需要一个BeautifulSoup对象,而不是一个字符串。因此,请删除get_text()对汤的电话。

此外,您可以将raw.find_all('title', limit=1)替换为等效的find('title')

试试这个:

from urllib import request
url = "http://www.bbc.co.uk/news/election-us-2016-35791008"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title')

print(title) # Prints the tag
print(title.string) # Prints the tag string content

答案 1 :(得分:1)

您可以直接使用“ soup.title”代替“ soup.find_all('title',limit = 1)”或“ soup.find('title')”,它将为您提供标题。

from urllib import request
url = "http://www.bbc.co.uk/news/election-us-2016-35791008"
html = request.urlopen(url).read().decode('utf8')
html[:60]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
title = soup.title
print(title)
print(title.string)

答案 2 :(得分:0)

就这么简单吧:

soup = BeautifulSoup(htmlString, 'html.parser')
title = soup.title.text

在这里, soup.title 返回 BeautifulSoup元素,即标题元素。

答案 3 :(得分:0)

在某些页面中,我遇到了 NoneType 问题。一个建议是:

soup = BeautifulSoup(data, 'html.parser')
if (soup.title is not None):
    title = soup.title.string