BeautifulSoup找不到元标记信息

时间:2018-08-03 14:57:03

标签: python beautifulsoup python-requests

所有三个标题均返回“无”。但是,当我查看页面源时,可以清楚地看到twitter:titleog:titleog:description明确存在。

url = 'https://www.vox.com/culture/2018/8/3/17644464/christopher-robin-review-pooh-bear-winnie'
response = requests.get(url)

soup = BeautifulSoup(response.text, "lxml")

title = soup.find("meta",  property="twitter:title")
title2 = soup.find("meta",  property="og:title")
title3 = soup.find("meta",  property="og:description")

print("TITLE: "+str(title))
print("TITLE2: "+str(title2))
print("TITLE3: "+str(title3))

2 个答案:

答案 0 :(得分:0)

soup.find("meta", property="twitter:title")必须为soup.find("meta", {"name": "twitter:title"})(这是名称,而不是属性)。另外两行对我来说很好。

答案 1 :(得分:0)

您需要在标题中指定User-Agenttwitter:title也位于name属性中:

from bs4 import BeautifulSoup
import requests

url = 'https://www.vox.com/culture/2018/8/3/17644464/christopher-robin-review-pooh-bear-winnie'

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, "lxml")

title1 = soup.select_one('meta[name=twitter:title]')['content']
title2 = soup.select_one('meta[property=og:title]')['content']
title3 = soup.select_one('meta[property=og:description]')['content']

print("TITLE1: "+str(title1))
print("TITLE2: "+str(title2))
print("TITLE3: "+str(title3))

打印:

TITLE1: Christopher Robin is a corporate cash-in, but it fakes sincerity better than most
TITLE2: Christopher Robin is a corporate cash-in, but it fakes sincerity better than most
TITLE3: Winnie the Pooh and pals return to give their old friend a pep talk in a movie overshadowed by the company that made it.