网络抓取交互式图表

时间:2020-08-27 20:54:10

标签: javascript python html web-scraping

我看到有几篇文章,但是每种情况显然都是独特的。我正在尝试获取此页面上图表背后的数据:https://www.tradingview.com/symbols/NASDAQ-VOLI/

这是一个相当晦涩的市场指数,无法通过Yahoo获得,这是我通常看的地方(特别是python中的web.DataReader),这是看上去具有全套指标的少数几个景点之一每日价格。

<script nonce="XL1oARYPz8X2tvqk">
    window.__defaultsOverrides = {
        'mainSeriesProperties.style': 3,
        'mainSeriesProperties.areaStyle.priceSource': 'close',
        'scalesProperties.lineColor': 'rgba( 76, 82, 94, 1)',
        'scalesProperties.showSymbolLabels': false,
        'scalesProperties.textColor': 'rgba( 76, 82, 94, 1)',
        'scalesProperties.seriesLastValueMode': 0,
        'paneProperties.topMargin': 13,
        'paneProperties.legendProperties.showStudyArguments': false,
        'paneProperties.legendProperties.showStudyTitles': false,
        'paneProperties.legendProperties.showStudyValues': false,
        'paneProperties.legendProperties.showSeriesTitle': false,
        'paneProperties.legendProperties.showSeriesOHLC': true,
        'paneProperties.legendProperties.showLegend': false,
    };
</script>

这是与图表相关的元素,坦率地说,在Web开发方面,它只是脚本标记(即,不仅是图表元素的子元素,它还是图表)元件)。我尝试在JS文件中搜索XL1oARYPz8X2tvqk的现时值,但看不到任何看起来像是填充图表的东西。

我认为我可以在窗口对象中的某个地方找到图表数据,但是我没有看到它。有没有简单的方法可以追踪到这一点?我知道我可以使用交互式刮板,但似乎必须比这更容易。

1 个答案:

答案 0 :(得分:2)

从以下位置的websocket连接中检索数据:

wss://data.tradingview.com/socket.io/websocket?from=symbols%2FNASDAQ-VOLI%2F

您可以通过发送命令并从此Websocket接收数据来获取这些数据。您可以查看从Chrome开发控制台接收和发送的所有消息:

websocket console screenshot

格式是JSON对象流(每个响应可以是多个对象),带有一些前缀,例如~m~23+~m~。因此,有必要使用正则表达式(中间的数字更改)来拆分响应

在上面的屏幕快照中,您会看到很多要发送的消息(绿色消息),但是我们只对使用“图表会话令牌”的用户感兴趣,例如控制图表而不用引号的命令

以下消息从头开始发送:

{"m": "set_data_quality", "p": ["low"]},
{"m": "set_auth_token", "p": ["unauthorized_user_token"]},
{"m":"chart_create_session","p":[chartSession,""]},
{"m":"resolve_symbol","p":[chartSession,"symbol_1","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"create_series","p":[chartSession,"s1","s1","symbol_1","D",300]},
{"m":"switch_timezone","p":[chartSession,"Etc/UTC"]},
{"m":"resolve_symbol","p":[chartSession,"symbol_2","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
{"m":"modify_series","p":[chartSession,"s1","s2","symbol_2","D,12M"]},

此后,您会收到一条响应,其中包含带有图表数据等的值为timescale_update的消息

以下脚本启动websocket连接,发送获取图表数据所需的初始消息,并使用matplotlib构建保存为png的图形:

import json 
import websockets
import urllib
import asyncio
import re
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

wsParams = {
    "from": "symbols/NASDAQ-VOLI/"
}
websocketUri = f"wss://data.tradingview.com/socket.io/websocket?{urllib.parse.urlencode(wsParams)}"

result = []
chartSession = "cs_Dj1BV8ochLL0"

initMessages = [
    {"m": "set_data_quality", "p": ["low"]},
    {"m": "set_auth_token", "p": ["unauthorized_user_token"]},
    {"m":"chart_create_session","p":[chartSession,""]},
    {"m":"resolve_symbol","p":[chartSession,"symbol_1","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
    {"m":"create_series","p":[chartSession,"s1","s1","symbol_1","D",300]},
    {"m":"switch_timezone","p":[chartSession,"Etc/UTC"]},
    {"m":"resolve_symbol","p":[chartSession,"symbol_2","={\"symbol\":\"NASDAQ:VOLI\",\"adjustment\":\"splits\",\"session\":\"extended\"}"]},
    {"m":"modify_series","p":[chartSession,"s1","s2","symbol_2","D,12M"]},
]

def strip(text):
    noDataReg = re.match('~m~\d+~m~~h~\d+', text, re.MULTILINE)
    if not noDataReg:
        dataReg = re.split('~m~\d+~m~', text)
        return [json.loads(t) for t in dataReg if t]
    return []

def unstrip(text):
    return f"~m~{len(text)-8}~m~{json.dumps(text)}"

async def init(websocket):
    for m in initMessages:
        await websocket.send(unstrip(m))

async def startReceiving(websocket):
    data = await websocket.recv()
    print(strip(data))
    await init(websocket)
    while(True):
        data = await websocket.recv()
        payloads = strip(data)
        for p in payloads:
            if p["m"] == "timescale_update":
                dates = [
                    datetime.fromtimestamp(t["v"][0])
                    for t in p["p"][1]["s1"]["s"]
                ]
                values = [
                    t["v"][4]
                    for t in p["p"][1]["s1"]["s"]
                ]
                plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d/%m/%Y'))
                plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=25))
                plt.plot(dates, values)
                plt.gcf().autofmt_xdate()
                plt.ylabel('VOLI Index Chart')
                plt.xlabel('Date')
                plt.savefig("voli.png")
        print(payloads)

async def websocketConnect():
    async with websockets.client.connect(websocketUri, extra_headers= {
            "Origin": "https://www.tradingview.com"
        }) as websocket:
        print(f'started websocket')
        await startReceiving(websocket)

asyncio.get_event_loop().run_until_complete(websocketConnect())

Try this on repl.it (without the matplotlib part)

生成的图表:

generated chart image using matploptlib

一些注意事项:

  • 为了成功连接到websocket服务器,您需要发送带有正确值的Origin标头,否则返回403

  • 这里的图表会话令牌是硬编码的,但可以是任何东西,它似乎是在网站上随机生成的(带有正则表达式)

  • 我已删除所有有关报价的websocket消息,您需要添加此类消息以接收有关“实时”值更改的通知(将其添加到init消息中):

    {"m":"quote_create_session","p":["qs_QrddDPrS65gC"]}
    
    {"m":"quote_add_symbols","p":["qs_QrddDPrS65gC","NASDAQ:VOLI",{"flags":["force_permission"]}]}
    

请注意,quote_create_session对于新的会话令牌(图表会话令牌中的!=)是必需的。然后您将通过websocket接收通知

  • 如果要接收通知,请注意,如果您在x时间内没有发送任何消息,则有一个活着的功能会自动关闭websocket。您只需要定期发送以下命令:

    ~m~4~m~~h~1