Python屏幕抓取Forbes.com

时间:2018-08-09 19:15:00

标签: python redirect web-scraping

我正在编写一个Python程序,用于从有趣的在线技术文章中提取和存储元数据:“ og:title”,“ og:description”,“ og:image”,og:url和og:site_name。

这是我正在使用的代码...

# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"

# Create the Request
http = urllib3.PoolManager()

# Create the Response
response = http.request('GET ', url, headers)

# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')

# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
   if len(tag.get("content", None)) > len(title):
      title = tag.get("content", None)

该程序可以在除一个站点之外的所有站点上正常运行。在“ forbes.com”上,我无法使用Python来阅读文章:

url = https://www.forbes.com/consent/?toURL=https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086

我无法绕过此同意页面;这似乎是“ TrustArc”的“ Cookie Consent Manager”解决方案。基本上,您可以在计算机上表示同意...并且每次连续运行,您都可以访问文章。

如果我引用“ toURL” URL: https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086

并绕过“ https://www.forbes.com/consent/”页面,我被重定向回该页面。

我尝试查看是否可以在标头中设置cookie,但找不到魔术钥匙。

有人可以帮助我吗?

1 个答案:

答案 0 :(得分:0)

有一个必需的cookie notice_gdpr_prefs需要发送以查看数据:

import requests
from bs4 import BeautifulSoup

src = requests.get(
    "https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
    headers= {
        "cookie": "notice_gdpr_prefs"
    })

soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta",  property="og:title")
print(title["content"])