PYTHON-如何从MarketWatch.com抓取股票“关键数据”,在此动态生成数据。找到数据请求呼叫了吗?

时间:2018-08-22 15:44:24

标签: python web-scraping beautifulsoup

我正在处理一个样本/个人项目,该项目每天会从网站(如市场观察)的网站上检索一次特定股票的股票数据,然后将该数据与其他网站(如Google财经/雅虎财经/路透社)进行比较并测试准确性。

我一直坚持从MarketWatch检索数据。我正在寻找的“关键数据”(可以通过以下网址找到:https://www.marketwatch.com/investing/stock/aapl)似乎是动态生成的,因为当我以编程方式收集网页HTML时,与访问网站相比,它几乎不包含任何数据在浏览器中。

我曾尝试在浏览器中打开开发者控制台并查找AJAX调用,但未成功找到任何内容。我可以轻松地跳过从MarketWatch收集数据并继续前进,但是我将其视为提高我的33t编程技能的挑战。

有人能指出我正确的方向吗?我想找到一种方法来获得对数据请求的正确调用,或者仅在标头中发送特定值时才显示数据?那是我的主意。我正在使用Python和Beautiful Soup解析任何数据。

谢谢您的时间。

1 个答案:

答案 0 :(得分:1)

有关OPS评论的新信息

我的错!尝试这样的事情。关键在于获取Cookie并确保它在您的get request标头中。您可以在那个市场观察页面上,从Web浏览器的开发人员网络选项卡中手动获取此Cookie。只需从此处找到该网页的get请求,进入请求标头,然后将Cookie复制/粘贴到您的代码中即可。这是一个超长的弦。在服务器返回完整网页之前,您将需要它。

我确信在发出包含您的数据的实际获取请求之前,有一种方法可以从marketwatch.com通过代码获取此Cookie。如果需要,我也可以尝试找出答案。

import requests
from bs4 import BeautifulSoup

url = 'https://www.marketwatch.com/investing/stock/aapl'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
                               "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                               "Cookie": "refresh=on; letsGetMikey=enabled; "
                                         "MicrosoftApplicationsTelemetryDeviceId=46fa0ca5-2561-7fe5-fd62-5b632398b7f4; "
                                         "MicrosoftApplicationsTelemetryFirstLaunchTime=1534997155966; "
                                         "pf_ffm=9bffce74bd493d996d1ae35769695510; "
                                         "mw_loc=%7B%22country%22%3A%22US%22%2C%22region%22%3A%22TX%22%2C%22city%22%3A%22"
                                         "PLANO%22%2C%22county%22%3A%5B%22COLLIN%22%5D%2C%22continent%22%3A%22NA%22%7D; "
                                         "seenads=0; fullcss-quote=quote-85dcea2e5c.min.css; "
                                         "utag_main=v_id:016564f545b40022de054359ac4403044003000900bd0$_sn:1$_ss:0$_st:"
                                         "1534999294146$ses_id:1534997120440%3Bexp-session$_pn:2%3Bexp-session$"
                                         "_prevpage:MW_Quote_Page%3Bexp-1535001094154$vapi_domain:marketwatch.com; "
                                         "AMCV_CB68E4BA55144CAA0A4C98A5%40AdobeOrg=-1891778711%7CMCIDTS%7C17767%7CMCMID"
                                         "%7C01084133064198290912411637324115388504%7CMCAAMLH-1535601937%7C9%7CMCAAMB"
                                         "-1535601937%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT"
                                         "-1535004337s%7CNONE%7CMCSYNCSOP%7C411-17774%7CMCAID%7CNONE%7CvVersion%7C2.4.0"
                                         "; icons-loaded=true; AMCVS_CB68E4BA55144CAA0A4C98A5%40AdobeOrg=1; __gads=ID="
                                         "c71aa1564ab44c97:T=1534997138:S=ALNI_Mbyv41MxhHTThfXFxMGtCFVyzsQaQ; "
                                         "vidoraUserId=agqj4i6ugtd359uhgkfl761k4uu55g; __qca=P0-1349423161-15349971262"
                                         "36; _ncg_sp_ses.f57d=*; _ncg_sp_id.f57d=b8b37d7b-2719-4a9c-baf9-3695f9deb20"
                                         "8.1534997155.1.1534997520.1534997155.9a279294-1ff6-449e-a605-29c39215cfb4;"
                                         " _ncg_id_=b8b37d7b-2719-4a9c-baf9-3695f9deb208; _ncg_g_id_=bd42bc08-4ebd-"
                                         "44a1-8e7a-6e1c3eaac874; _parsely_visitor={%22id%22:%2211d7400c-c4f5-4322-"
                                         "b109-0b01a21a74f2%22%2C%22session_count%22:1%2C%22last_session_"
                                         "ts%22:1534997165510}; _parsely_session={%22sid%22:1%2C%22surl%22:"
                                         "%22https://www.marketwatch.com/investing/stock/aapl%22%2C%22sref%22:"
                                         "%22%22%2C%22sts%22:1534997165510%2C%22slts%22:0}; "
                                         "s_ppvl=MW_Quote_Page%2C27%2C27%2C945%2C1076%2C945%2C1920%2C1080%2C1%2CP;"
                                         " s_ppv=MW_Quote_Page%2C23%2C23%2C945%2C1076%2C945%2C1920%2C1080%2C1%2CP;"
                                         " s_cc=true; cX_P=jl61ojfsquomne4l; usr_bkt=63L1D4y2F9; cX_S=jl61ojgcxgxachax;"
                                         " cX_G=cx%3A12c0heqgxq7ug25eyhsfbg5iro%3A3qlznewunoji0; "
                                         "recentqsmkii=Stock-US-AAPL; __utma=246750488.1666075546.1534997552."
                                         "1534997552.1534997552.1; __utmb=246750488.1.9.1534997559734; "
                                         "__utmc=246750488; __utmz=246750488.1534997552.1.1.utmcsr=(direct)"
                                         "|utmccn=(direct)|utmcmd=(none)"})
print(r)
soup = BeautifulSoup(r.content, "html.parser")
key_data = soup.find_all('li', class_="kv__item")
# Key Data Field Names
print(soup.find_all('small', class_="kv__label"))
# Key Data Field Values
print(soup.find_all('span', class_="kv__primary"))

响应:

<Response [200]>
[<small class="kv__label">Open</small>, <small class="kv__label">Day Range</small>, <small class="kv__label">52 Week Range</small>, <small class="kv__label">Market Cap</small>, <small class="kv__label">Shares Outstanding</small>, <small class="kv__label">Public Float</small>, <small class="kv__label">Beta</small>, <small class="kv__label">Rev. per Employee</small>, <small class="kv__label">P/E Ratio</small>, <small class="kv__label">EPS</small>, <small class="kv__label">Yield</small>, <small class="kv__label">Dividend</small>, <small class="kv__label">Ex-Dividend Date</small>, <small class="kv__label">Short Interest</small>, <small class="kv__label">% of Float Shorted</small>, <small class="kv__label">Average Volume</small>]
[<span class="kv__value kv__primary ">$214.10</span>, <span class="kv__value kv__primary ">213.84 - 216.36</span>, <span class="kv__value kv__primary ">149.16 - 219.18</span>, <span class="kv__value kv__primary ">$1.04T</span>, <span class="kv__value kv__primary ">4.83B</span>, <span class="kv__value kv__primary ">4.82B</span>, <span class="kv__value kv__primary ">1.02</span>, <span class="kv__value kv__primary ">$2.08M</span>, <span class="kv__value kv__primary ">19.50</span>, <span class="kv__value kv__primary ">$11.03</span>, <span class="kv__value kv__primary ">1.36%</span>, <span class="kv__value kv__primary ">$0.73</span>, <span class="kv__value kv__primary ">Aug 10, 2018</span>, <span class="kv__value kv__primary ">37.27M</span>, <span class="kv__value kv__primary ">0.77%</span>, <span class="kv__value kv__primary ">24.1M</span>]

END NEW INFORMATION

如果您想从该市场观察页面的图表中获取每日股价数据,则类似的方法将起作用。他们确实有一个API路由。 您可能需要更新EntitlementToken才能起作用

import requests
import json

# May need to update the EntitlementToken. To do so go to https://www.marketwatch.com/investing/stock/aapl,
#  watch network connections, find the api call and parse out the token
# If token does not match. api call will return a 400

req_url = 'https://api-secure.wsj.net/api/michelangelo/timeseries/history?json={"Step":"PT1M","TimeFrame":"D1",' \
          '"EntitlementToken":"cecc4267a0194af89ca343805a3e57af","IncludeMockTick":true,"FilterNullSlots":false,' \
          '"FilterClosedPoints":true,"IncludeClosedSlots":false,"IncludeOfficialClose":true,"InjectOpen":false,' \
          '"ShowPreMarket":false,"ShowAfterHours":false,"UseExtendedTimeFrame":false,"WantPriorClose":true,' \
          '"IncludeCurrentQuotes":false,"ResetTodaysAfterHoursPercentChange":false,' \
          '"Series":[{"Key":"STOCK/US/XNAS/AAPL","Dialect":"Charting","Kind":"Ticker","SeriesId":"s1",' \
          '"DataTypes":["Last"],"Indicators":[{"Parameters":[{"Name":"ShowOpen"},{"Name":"ShowHigh"},' \
          '{"Name":"ShowLow"},{"Name":"ShowPriorClose","Value":true},{"Name":"Show52WeekHigh"},' \
          '{"Name":"Show52WeekLow"}],"Kind":"OpenHighLowLines","SeriesId":"i2"}]}]}&ckey=cecc4267a0'

r = requests.get(req_url, headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
                                   "Content-Type": "application/json, text/javascript, */*; q=0.01",
                                   "Dylan2010.EntitlementToken": "cecc4267a0194af89ca343805a3e57af"})
# Full Return
print(r)

# Stock UNIX Dates
print(json.loads(r.content)['TimeInfo']['Ticks'])

# Stock Prices
print(json.loads(r.content)['Series'][0]['DataPoints'])

将打印出以下数据(这只是收益的前5条记录的示例

<Response [200]>
# Unix Datetime Stamps
[1534944600000, 1534944660000, 1534944720000, 1534944780000, 1534944840000]
# AAPL Prices
[[214.9001], [214.81], [214.84], [215.31], [215.2]]

如果您需要定期访问免费的财务数据,我强烈建议您使用YahooFinancials

https://github.com/JECSand/yahoofinancials

安装:

$ pip install yahoofinancials

用法示例:

from yahoofinancials import YahooFinancials

tech_stocks = ['AAPL', 'MSFT', 'INTC']
yahoo_financials_tech = YahooFinancials(tech_stocks)
print(yahoo_financials_tech.get_historical_price_data("2018-08-01", "2018-08-10", "weekly"))

结果:

   {
        "AAPL": {
            "currency": "USD", 
            "eventsData": {
                "dividends": {
                    "2018-08-06": {
                        "amount": 0.73, 
                        "date": 1533907800, 
                        "formatted_date": "2018-08-10"
                    }
                }
            }, 
            "firstTradeDate": {
                "date": 345459600, 
                "formatted_date": "1980-12-12"
            }, 
            "instrumentType": "EQUITY", 
            "prices": [
                {
                    "adjclose": 207.2631072998047, 
                    "close": 207.99000549316406, 
                    "date": 1532923200, 
                    "formatted_date": "2018-07-30", 
                    "high": 208.74000549316406, 
                    "low": 197.30999755859375, 
                    "open": 199.1300048828125, 
                    "volume": 163787100
                }, 
                {
                    "adjclose": 206.80471801757812, 
                    "close": 207.52999877929688, 
                    "date": 1533528000, 
                    "formatted_date": "2018-08-06", 
                    "high": 209.77999877929688, 
                    "low": 204.52000427246094, 
                    "open": 208.0, 
                    "volume": 121618700
                }
            ], 
            "timeZone": {
                "gmtOffset": -14400
            }
        }, 
        "INTC": {
            "currency": "USD", 
            "eventsData": {
                "dividends": {
                    "2018-08-06": {
                        "amount": 0.3, 
                        "date": 1533562200, 
                        "formatted_date": "2018-08-06"
                    }
                }
            }, 
            "firstTradeDate": {
                "date": 322131600, 
                "formatted_date": "1980-03-17"
            }, 
            "instrumentType": "EQUITY", 
            "prices": [
                {
                    "adjclose": 49.33000183105469, 
                    "close": 49.630001068115234, 
                    "date": 1532923200, 
                    "formatted_date": "2018-07-30", 
                    "high": 49.779998779296875, 
                    "low": 48.0, 
                    "open": 48.060001373291016, 
                    "volume": 76521400
                }, 
                {
                    "adjclose": 48.55471420288086, 
                    "close": 48.849998474121094, 
                    "date": 1533528000, 
                    "formatted_date": "2018-08-06", 
                    "high": 50.599998474121094, 
                    "low": 48.29999923706055, 
                    "open": 48.77000045776367, 
                    "volume": 129482900
                }
            ], 
            "timeZone": {
                "gmtOffset": -14400
            }
        }, 
        "MSFT": {
            "currency": "USD", 
            "eventsData": {}, 
            "firstTradeDate": {
                "date": 511088400, 
                "formatted_date": "1986-03-13"
            }, 
            "instrumentType": "EQUITY", 
            "prices": [
                {
                    "adjclose": 107.62582397460938, 
                    "close": 108.04000091552734, 
                    "date": 1532923200, 
                    "formatted_date": "2018-07-30", 
                    "high": 108.08999633789062, 
                    "low": 104.83999633789062, 
                    "open": 106.02999877929688, 
                    "volume": 68392600
                }, 
                {
                    "adjclose": 108.58214569091797, 
                    "close": 109.0, 
                    "date": 1533528000, 
                    "formatted_date": "2018-08-06", 
                    "high": 110.16000366210938, 
                    "low": 107.55999755859375, 
                    "open": 108.12000274658203, 
                    "volume": 83677700
                }
            ], 
            "timeZone": {
                "gmtOffset": -14400
            }
        }
    }