可以将抓取应用于正在重新计算的此页面吗?

时间:2015-12-25 01:23:10

标签: python web-scraping screen-scraping

我想从下面的页面中获取卫星位置,但我不确定抓取是否合适,因为页面似乎每秒使用一些内部代码更新自身(它在我之后不断更新)断开互联网)。背景信息可以在Space Stackexchange的问题中找到:A nicer way to download the positions of the Orbcomm-2 satellites

我需要一个"快照"四个项目同时

  • UTC时间
  • 纬度
  • 经度
  • 高度

现在我使用屏幕截图和手动输入。由于页面正在更新这些值 - 传统的网络抓取是否会在这里发挥作用?我找到了一个"屏幕抓取"标签,我应该尝试去了解它吗?

我正在寻找获得这四个值的最简单的解决方案,我想知道我是否可以使用urlliburllib2并避免安装新的内容?

示例页面:http://www.satview.org/?sat_id=41186U我需要通过41189U进行41179U(SpaceX刚刚进入轨道的11颗Orbcomm-2卫星)

highlighted screen shot of satview.org

3 个答案:

答案 0 :(得分:3)

一种选择是启动真实浏览器并持续轮询该位置:

import time
from selenium import webdriver

driver = webdriver.Firefox()
driver.get("http://www.satview.org/?sat_id=41186U")


while True:
    location = driver.find_element_by_css_selector("#sat_latlon .texto_track2").text
    latitude, longitude = location.split("\n")[:2]

    print(latitude, longitude)

    time.sleep(1)

示例输出:

(u'-16.57', u'66.63')
(u'-16.61', u'66.67')
...

我们正在使用selenium和Firefox - 针对不同的浏览器有多个驱动程序,包括无头,如PhantomJS

答案 1 :(得分:1)

无需刮伤。只需查看该页面的源html并复制/粘贴javascript代码即可。没有任何位置是远程获取的...它们都是在页面中动态计算的。所以只需抓住代码并自己运行即可!

答案 2 :(得分:1)

Space-Track.org的REST API似乎是为处理这种类型的请求而构建的。在那里拥有帐户后,您甚至可以下载示例脚本(在此处更新)以下载TLES:

# STTest.py
#
# Simple Python app to extract resident space object history data from www.space-track.org into a spreadsheet
# (prior to executing, register for a free personal account at https://www.space-track.org/auth/createAccount)
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free Software Foundation,
# either version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
# without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU General Public License for more details.
#
# For full licencing terms, please refer to the GNU General Public License (gpl-3_0.txt) distributed with this release,
# or see http://www.gnu.org/licenses/gpl-3.0.html.

import requests
import json
import xlsxwriter
import time
import datetime
import getpass
import sys

class MyError(Exception):
    def __init___(self, args):
        Exception.__init__(self, "my exception was raised with arguments {0}".format(args))
        self.args = args

# See https://www.space-track.org/documentation for details on REST queries
# "Find Starlinks" query finds all satellites w/ NORAD_CAT_ID > 40000 & OBJECT_NAME matching STARLINK*, 1 line per sat;
# the "OMM Starlink" query gets all Orbital Mean-Elements Messages (OMM) for a specific NORAD_CAT_ID in JSON format.

uriBase = "https://www.space-track.org"
requestLogin = "/ajaxauth/login"
requestCmdAction = "/basicspacedata/query"
requestFindStarlinks = "/class/tle_latest/NORAD_CAT_ID/>40000/ORDINAL/1/OBJECT_NAME/STARLINK~~/format/json/orderby/NORAD_CAT_ID%20asc"
requestOMMStarlink1 = "/class/omm/NORAD_CAT_ID/"
requestOMMStarlink2 = "/orderby/EPOCH%20asc/format/json"

# Parameters to derive apoapsis and periapsis from mean motion (see https://en.wikipedia.org/wiki/Mean_motion)
GM = 398600441800000.0
GM13 = GM ** (1.0 / 3.0)
MRAD = 6378.137
PI = 3.14159265358979
TPI86 = 2.0 * PI / 86400.0

# Log in to personal account obtained by registering for free at https://www.space-track.org/auth/createAccount
print('\nEnter your personal Space-Track.org username (usually your email address for registration):  ')
configUsr = input()
print('Username capture complete.\n')
configPwd = getpass.getpass(prompt='Securely enter your Space-Track.org password (minimum of 15 characters):  ')
# Excel Output file name - e.g. starlink-track.xlsx (note: make it an .xlsx file)
configOut = 'STText.xslx'
siteCred = {'identity': configUsr, 'password': configPwd}

# User xlsxwriter package to write the .xlsx file
print('Creating Microsoft Excel (.xlsx) file to contain outputs...')
workbook = xlsxwriter.Workbook(configOut)
worksheet = workbook.add_worksheet()
z0_format = workbook.add_format({'num_format': '#,##0'})
z1_format = workbook.add_format({'num_format': '#,##0.0'})
z2_format = workbook.add_format({'num_format': '#,##0.00'})
z3_format = workbook.add_format({'num_format': '#,##0.000'})

# write the headers on the spreadsheet
print('Starting to write outputs to Excel file create...')
now = datetime.datetime.now()
nowStr = now.strftime("%m/%d/%Y %H:%M:%S")
worksheet.write('A1', 'Starlink data from' + uriBase + " on " + nowStr)
worksheet.write('A3', 'NORAD_CAT_ID')
worksheet.write('B3', 'SATNAME')
worksheet.write('C3', 'EPOCH')
worksheet.write('D3', 'Orb')
worksheet.write('E3', 'Inc')
worksheet.write('F3', 'Ecc')
worksheet.write('G3', 'MnM')
worksheet.write('H3', 'ApA')
worksheet.write('I3', 'PeA')
worksheet.write('J3', 'AvA')
worksheet.write('K3', 'LAN')
worksheet.write('L3', 'AgP')
worksheet.write('M3', 'MnA')
worksheet.write('N3', 'SMa')
worksheet.write('O3', 'T')
worksheet.write('P3', 'Vel')
wsline = 3


def countdown(t, step=1, msg='Sleeping...'):  # in seconds
    pad_str = ' ' * len('%d' % step)
    for i in range(t, 0, -step):
        sys.stdout.write('{} for the next {} seconds {}\r'.format(msg, i, pad_str))
        sys.stdout.flush()
        time.sleep(step)
    print('Done {} for {} seconds!  {}'.format(msg, t, pad_str))

# use requests package to drive the RESTful session with space-track.org
print('Interfacing with SpaceTrack.org to obtain data...')
with requests.Session() as session:

    # Need to log in first. NOTE:  we get a 200 to say the web site got the data, not that we are logged in.
    resp = session.post(uriBase + requestLogin, data=siteCred)
    if resp.status_code != 200:
        raise MyError(resp, "POST fail on login.")

    # This query picks up all Starlink satellites from the catalog. NOTE: a 401 failure shows you have bad credentials.
    resp = session.get(uriBase + requestCmdAction + requestFindStarlinks)
    if resp.status_code != 200:
        print(resp)
        raise MyError(resp, "GET fail on request for resident space objects.")

    # Use json package to break json-formatted response into a Python structure (a list of dictionaries)
    retData = json.loads(resp.text)
    satCount = len(retData)
    satIds = []
    for e in retData:
        # each e describes the latest elements for one resident space object. We just need the NORAD_CAT_ID...
        catId = e['NORAD_CAT_ID']
        satIds.append(catId)

    # Using our new list of resident space object NORAD_CAT_IDs, we can now get the OMM message
    maxs = 1 # counter for number of sessions we have established without a pause in querying space-track.org
    for s in satIds:
        resp = session.get(uriBase + requestCmdAction + requestOMMStarlink1 + s + requestOMMStarlink2)
        if resp.status_code != 200:
            # If you are getting error 500's here, its probably the rate throttle on the site (20/min and 200/hr)
            # wait a while and retry
            print(resp)
            raise MyError(resp, "GET fail on request for resident space object number " + s + '.')

        # the data here can be quite large, as it's all the elements for every entry for one resident space object
        retData = json.loads(resp.text)
        for e in retData:
            # each element is one reading of the orbital elements for one resident space object
            print("Scanning satellite " + e['OBJECT_NAME'] + " at epoch " + e['EPOCH'] + '...')
            mmoti = float(e['MEAN_MOTION'])
            ecc = float(e['ECCENTRICITY'])
            worksheet.write(wsline, 0, int(e['NORAD_CAT_ID']))
            worksheet.write(wsline, 1, e['OBJECT_NAME'])
            worksheet.write(wsline, 2, e['EPOCH'])
            worksheet.write(wsline, 3, float(e['REV_AT_EPOCH']))
            worksheet.write(wsline, 4, float(e['INCLINATION']), z1_format)
            worksheet.write(wsline, 5, ecc, z3_format)
            worksheet.write(wsline, 6, mmoti, z1_format)
            # do some ninja-fu to flip Mean Motion into Apoapsis and Periapsis, and to get orbital period and velocity
            sma = GM13 / ((TPI86 * mmoti) ** (2.0 / 3.0)) / 1000.0
            apo = sma * (1.0 + ecc) - MRAD
            per = sma * (1.0 - ecc) - MRAD
            smak = sma * 1000.0
            orbT = 2.0 * PI * ((smak ** 3.0) / GM) ** (0.5)
            orbV = (GM / smak) ** (0.5)
            worksheet.write(wsline, 7, apo, z1_format)
            worksheet.write(wsline, 8, per, z1_format)
            worksheet.write(wsline, 9, (apo + per) / 2.0, z1_format)
            worksheet.write(wsline, 10, float(e['RA_OF_ASC_NODE']), z1_format)
            worksheet.write(wsline, 11, float(e['ARG_OF_PERICENTER']), z1_format)
            worksheet.write(wsline, 12, float(e['MEAN_ANOMALY']), z1_format)
            worksheet.write(wsline, 13, sma, z1_format)
            worksheet.write(wsline, 14, orbT, z0_format)
            worksheet.write(wsline, 15, orbV, z0_format)
            wsline = wsline + 1
        maxs = maxs + 1
        print(str(maxs))
        if maxs > 18:
            print('\nSnoozing for 60 secs for rate limit reasons (max 20/min and 200/hr).')
            countdown(60)
            maxs = 1
    session.close()
workbook.close()
print('\nCompleted session.')