Question

我正在尝试抓取一个金融网站，以制作一个可以比较其他各个网站（google / yahoo财务）的财务数据准确性的应用程序。这是我真正开始的一个个人项目，旨在学习Python编程和编写脚本。

我要抓取的URL（特别是股票的“关键数据”，如市值，交易量等）在这里：

https://www.marketwatch.com/investing/stock/sbux

我发现（在其他人的帮助下）必须构建一个cookie并随每个请求一起发送，以便页面显示数据（否则页面html响应几乎返回空）。

我使用Opera / Firefox / Chrome浏览器来查看HTTP标头和从浏览器发送回的请求。我得出的结论是，需要执行3个步骤/请求才能接收所有cookie数据并逐步构建它。

步骤/请求1

只需访问上面的URL。

GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 579
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache

步骤/请求2

我不确定此“ POST” URL的来源。但是，使用Firefox并查看网络连接时，该URL会在“跟踪堆栈”选项卡中弹出。再说一次，我不知道如果每个人都相同或随机创建此URL。我也不知道正在发送什么POST数据，或者X-Hash-Result或X-Token-Value的值来自哪里。但是，此请求在响应标头中返回以下行，这是一个非常重要的值：'Set-Cookie：ncg_g_id_zeta = 701c19ee3f45d07b56b40fb8e313214d'，此Cookie对于下一个请求至关重要，以便返回完整的cookie并在网页上接收数据。

POST /149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint HTTP/1.1
Host: www.marketwatch.com:443
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Content-Type: application/json; charset=UTF-8
Origin: https://www.marketwatch.com
Referer: https://www.marketwatch.com/investing/stock/sbux
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44
X-Hash-Result: 701c19ee3f45d07b56b40fb8e313214d
X-Token-Value: 900c4055-ef7a-74a8-e9ec-f78f7edc363b

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 17
Content-Type: application/json; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache
Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d; Path=/; HttpOnly

步骤/请求3

此请求将发送到原始URL，并在步骤2中提取了cookie。然后，将完整的cookie返回到响应中，该响应可在步骤1中使用，以避免再次经历步骤2和3。它还将显示整页数据。

GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d
Referer: https://www.marketwatch.com/investing/stock/sbux
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 62944
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:17 GMT
Expires: Sun, 26 Aug 2018 05:12:17 GMT
Pragma: no-cache
Server: Kestrel
Set-Cookie: seenads=0; expires=Sun, 26 Aug 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Set-Cookie: mw_loc=%7B%22country%22%3A%22CA%22%2C%22region%22%3A%22ON%22%2C%22city%22%3A%22MARKHAM%22%2C%22county%22%3A%5B%22%22%5D%2C%22continent%22%3A%22NA%22%7D; expires=Sat, 01 Sep 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Vary: Accept-Encoding
x-frame-options: SAMEORIGIN
x-machine: 8cfa9f20bf3eb

摘要

总而言之，第2步对于获得剩余的cookie片段是最重要的...但是我无法弄清楚这三件事：

1） POST URL的来源（不是嵌入在原始页面中，每个人的URL都相同还是由网站随机生成）。

2）：POST请求中发送的数据是什么？

3） X哈希结果和X令牌值来自何处？是否需要在请求的标头中发送它？

这对我来说是一个很好的挑战，我花了几个小时（我对python和HTTP Web请求也很陌生），所以我觉得有更多经验的人也许可以在更及时。

感谢所有能提供帮助的人。

Answer 1

再次向The6ix致敬！

我今晚花了一些时间试图使cookie字符串附加到工作上。 MarketWatch在保护其数据方面做得相当不错。要构建整个Cookie，您将需要一个wsj API密钥（我认为是该网站的财务数据供应商）和一些隐藏的变量，这些变量仅可能对客户的服务器可用，并根据您的网络驱动程序的存在或不足而严格保留

例如，如果您尝试打以下请求：POST https://browser.pipe.aria.microsoft.com/Collector/3.0/?qsp=true&content-type=application/bond-compact-binary&client-id=NO_AUTH&sdk-version=ACT-Web-JS-2.7.1&x-apikey=c34cce5c21da4a91907bc59bce4784fb-42e261e9-5073-49df-a2e1-42415e012bc6-6954

您将收到400个未经授权的错误。

请记住，客户端主机服务器群集主服务器和与之通信的各种API很有可能在我们的浏览器无法拾取网络流量的情况下进行通信。例如某种中间件。我相信这可以解决缺少的X-Hash-Result和X-Token-Value值。

我并不是说不可能构建这个cookie字符串，只是就开发时间和精力而言，这是一条低效的途径。现在，我还质疑该方法是否易于使用，除了使用APL以外，还需要使用其他代码。除非明确要求不使用Web驱动程序和/或脚本需要高度可移植性，否则不允许在外部进行任何配置点安装的方法，我不会选择此方法。

从本质上讲，这使我们只能使用Scrapy Spider或Selenium Scraper（不幸的是，还有一些额外的环境配置，但是如果要编写和部署Web爬虫，这是非常重要的技能。 ，requests + bs4是理想的简单抓取/不寻常的代码可移植性需求。

我继续使用PhantomJS Web驱动程序为您编写了Selenium Scraper ETL类。它接受股票代码字符串作为参数，并且可以使用除AAPL以外的其他股票。这很棘手，因为marketwatch.com不会重定向来自PhantomJS Web驱动程序的流量（我可以说他们花费了大量资源试图阻止Web抓取工具。比说yahoo.com还要多得多）。

无论如何，这是最终的Selenium脚本，它在python 2和3上运行。

# Market Watch Test Scraper ETL
# Tested on python 2.7 and 3.5
# IMPORTANT: Ensure PhantomJS Web Driver is configured and installed

import pip
import sys
import signal
import time


# Package installer function to handle missing packages
def install(package):
    print(package + ' package for Python not found, pip installing now....')
    pip.main(['install', package])
    print(package + ' package has been successfully installed for Python\n Continuing Process...')

# Ensure beautifulsoup4 is installed
try:
    from bs4 import BeautifulSoup
except:
    install('beautifulsoup4')
    from bs4 import BeautifulSoup

# Ensure selenium is installed
try:
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
except:
    install('selenium')
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


# Class to extract and transform raw marketwatch.com financial data
class MarketWatchETL:

    def __init__(self, ticker):
        self.ticker = ticker.upper()
        # Set up desired capabilities to spoof Firefox since marketwatch.com rejects any PhantomJS Request
        self._dcap = dict(DesiredCapabilities.PHANTOMJS)
        self._dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) "
                                                           "AppleWebKit/537.36 (KHTML, like Gecko) "
                                                           "Chrome/29.0.1547.57 Safari/537.36")
        self._base_url = 'https://www.marketwatch.com/investing/stock/'
        self._retries = 10

    # Private Static Method to clean and organize Key Data Extract
    @staticmethod
    def _cleaned_key_data_object(raw_data):
        cleaned_data = {}
        raw_labels = raw_data['labels']
        raw_values = raw_data['values']
        i = 0
        for raw_label in raw_labels:
            raw_value = raw_values[i]
            cleaned_data.update({str(raw_label.get_text()): raw_value.get_text()})
            i += 1
        return cleaned_data

    # Private Method to scrap data from MarketWatch's web page
    def _scrap_financial_key_data(self):
        raw_data_obj = {}
        try:
            driver = webdriver.PhantomJS(desired_capabilities=self._dcap)
        except:
            print('***SETUP ERROR: The PhantomJS Web Driver is either not configured or incorrectly configured!***')
            sys.exit(1)
        driver.get(self._base_url + self.ticker)
        i = 0
        while i < self._retries:
            try:
                time.sleep(3)
                html = driver.page_source
                soup = BeautifulSoup(html, "html.parser")
                labels = soup.find_all('small', class_="kv__label")
                values = soup.find_all('span', class_="kv__primary")
                if labels and values:
                    raw_data_obj.update({'labels': labels})
                    raw_data_obj.update({'values': values})
                    break
                else:
                    i += 1
            except:
                i += 1
                continue
        if i == self._retries:
            print('Please check your internet connection!\nUnable to connect...')
            sys.exit(1)
        driver.service.process.send_signal(signal.SIGTERM)
        driver.quit()
        return raw_data_obj

    # Public Method to return a Stock's Key Data Object
    def get_stock_key_data(self):
        raw_data = self._scrap_financial_key_data()
        return self._cleaned_key_data_object(raw_data)


# Script's Main Process to test MarketWatchETL('TICKER')
if __name__ == '__main__':

    # Run financial key data extracts for Microsoft, Apple, and Wells Fargo
    msft_key_data = MarketWatchETL('MSFT').get_stock_key_data()
    aapl_key_data = MarketWatchETL('AAPL').get_stock_key_data()
    wfc_key_data = MarketWatchETL('WFC').get_stock_key_data()

    # Print result dictionaries
    print(msft_key_data.items())
    print(aapl_key_data.items())
    print(wfc_key_data.items())

哪个输出：

dict_items([('Rev. per Employee', '$841.03K'), ('Short Interest', '44.63M'), ('Yield', '1.53%'), ('Market Cap', '$831.23B'), ('Open', '$109.27'), ('EPS', '$2.11'), ('Shares Outstanding', '7.68B'), ('Ex-Dividend Date', 'Aug 15, 2018'), ('Day Range', '108.51 - 109.64'), ('Average Volume', '25.43M'), ('Dividend', '$0.42'), ('Public Float', '7.56B'), ('P/E Ratio', '51.94'), ('% of Float Shorted', '0.59%'), ('52 Week Range', '72.05 - 111.15'), ('Beta', '1.21')])
dict_items([('Rev. per Employee', '$2.08M'), ('Short Interest', '42.16M'), ('Yield', '1.34%'), ('Market Cap', '$1.04T'), ('Open', '$217.15'), ('EPS', '$11.03'), ('Shares Outstanding', '4.83B'), ('Ex-Dividend Date', 'Aug 10, 2018'), ('Day Range', '216.33 - 218.74'), ('Average Volume', '24.13M'), ('Dividend', '$0.73'), ('Public Float', '4.82B'), ('P/E Ratio', '19.76'), ('% of Float Shorted', '0.87%'), ('52 Week Range', '149.16 - 219.18'), ('Beta', '1.02')])
dict_items([('Rev. per Employee', '$384.4K'), ('Short Interest', '27.44M'), ('Yield', '2.91%'), ('Market Cap', '$282.66B'), ('Open', '$58.87'), ('EPS', '$3.94'), ('Shares Outstanding', '4.82B'), ('Ex-Dividend Date', 'Aug 9, 2018'), ('Day Range', '58.76 - 59.48'), ('Average Volume', '18.45M'), ('Dividend', '$0.43'), ('Public Float', '4.81B'), ('P/E Ratio', '15.00'), ('% of Float Shorted', '0.57%'), ('52 Week Range', '49.27 - 66.31'), ('Beta', '1.13')])

在运行此操作之前，您唯一需要做的额外步骤就是在部署环境中安装和配置PhantomJS Web驱动程序。如果您需要自动化Web爬网程序的部署这样，您可以编写一个bash / power shell安装程序脚本来处理预配置环境的PhantomJS。

一些用于安装和配置PhantomJS的资源：

Windows/Mac PhantomJS Installation Executables

Debian Linux PhantomJS Installation Guide

RHEL PhantomJS Installation Guide

如果您认为这是不完整的答案，那么我事先表示歉意。我只是怀疑以我在之前的帖子中建议的方式组装Cookie的可行性和可能性。

我认为这里的另一种实际可能性是编写一个Scrapy Crawler，如果您愿意的话，我明天晚上可以尝试为您做。

希望这对您有帮助！

Python请求（网页抓取）-构建cookie以能够查看网站中的数据

1 个答案: