我正在编写一个Python程序,用于从有趣的在线技术文章中提取和存储元数据:“ og:title”,“ og:description”,“ og:image”,og:url和og:site_name。>
这是我正在使用的代码...
# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
# Create the Request
http = urllib3.PoolManager()
# Create the Response
response = http.request('GET ', url, headers)
# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')
# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
if len(tag.get("content", None)) > len(title):
title = tag.get("content", None)
该程序可以在除一个站点之外的所有站点上正常运行。在“ forbes.com”上,我无法使用Python来阅读文章:
我无法绕过此同意页面;这似乎是“ TrustArc”的“ Cookie Consent Manager”解决方案。基本上,您可以在计算机上表示同意...并且每次连续运行,您都可以访问文章。
如果我引用“ toURL” URL: https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
并绕过“ https://www.forbes.com/consent/”页面,我被重定向回该页面。
我尝试查看是否可以在标头中设置cookie,但找不到魔术钥匙。
有人可以帮助我吗?
答案 0 :(得分:0)
有一个必需的cookie notice_gdpr_prefs
需要发送以查看数据:
import requests
from bs4 import BeautifulSoup
src = requests.get(
"https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
headers= {
"cookie": "notice_gdpr_prefs"
})
soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta", property="og:title")
print(title["content"])