我正在尝试从this news page中提取信息。
首先,我解析页面:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.theguardian.com/politics/2019/oct/20/boris-johnson-could-be-held-in-contempt-of-court-over-brexit-letter")
soup = BeautifulSoup(page.content, 'html.parser')
然后我以标题开头:
title = soup.find('meta', property="og:title")
如果我打印出来,我得到:
<meta content="Boris Johnson could be held in contempt of court over Brexit letter" property="og:title"/>
但是,当我运行title.get_text()
时,结果是一个空字符串:''
我的错误在哪里?
答案 0 :(得分:1)
那是因为标签实际上没有定义任何文本。在这种情况下,您追求的“文本”包含在<meta>
标记中,并带有属性content
。因此,您需要提取content
的值:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.theguardian.com/politics/2019/oct/20/boris-johnson-could-be-held-in-contempt-of-court-over-brexit-letter")
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('meta', property="og:title")['content']
输出:
print (title)
Boris Johnson could be held in contempt of court over Brexit letter
您可以使用.attrs
获取所有属性和值。这将返回给定标记内的属性和值的字典(键:值对):
title = soup.find('meta', property="og:title")
print (title.attrs)
输出:
print (title.attrs)
{'property': 'og:title', 'content': 'Boris Johnson could be held in contempt of court over Brexit letter'}