从网页API抓取数据块

时间:2018-08-11 16:11:14

标签: python json regex beautifulsoup

我试图从网页上收集构成一张小桌子的块数据。请在下面查看我的代码。

`

import requests
import re
import json
import sys
import os
import time
from lxml import html,etree
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.investing.com/instruments/OptionsDataAjax'
params = {'pair_id': 525, ## SPX
          'date': 1536555600, ## 2018-9-4
          'strike': 'all', ## all prices
          'callspots': 'calls',#'call_andputs',
          'type':'analysis', # webpage viewer
          'bringData':'true',
          }
headers = {'User-Agent': Chrome/39.0.2171.95 Safari/537.36'}
def R(text, end='\n'): print('\033[0;31m{}\033[0m'.format(text), end=end)
def G(text, end='\n'): print('\033[0;32m{}\033[0m'.format(text), end=end)
page = requests.get(url, params=params,headers = headers)
if page.status_code != 200:
    R('ERROR CODE:{}'.format(page.status_code))
    sys.exit
    G('Problem in connection!')
else:
    G('OK')
soup = BeautifulSoup(page.content,'lxml')
spdata = json.loads(soup.text)
print(spdata['data'])`

此结果-spdata ['data']给了我一个str,我只想在此str中获得以下代码块。此str中有许多具有相同格式的数据块。

    SymbolSPY180910C00250000
    Delta0.9656
    Imp Vol0.2431
    Bid33.26
    Gamma0.0039
    Theoretical33.06
    Ask33.41
    Theta-0.0381
    Intrinsic Value33.13
    Volume0
    Vega0.0617
    Time Value-33.13
    Open Interest0
    Rho0.1969
    Delta / Theta-25.3172

我在这里使用json和BeautifulSoup,也许正则表达式会有所帮助,但我对re不太了解。为了获得结果,任何方法都值得赞赏。谢谢。

1 个答案:

答案 0 :(得分:1)

在您的代码后添加此代码:

regex = r"((SymbolSPY[1-9]*):?\s*)(.*?)\n[^\S\n]*\n[^\S\n]*"
for match in re.finditer(regex, spdata['data'], re.MULTILINE | re.DOTALL):
    for line in match.group().splitlines():
        print (line.strip())

输出

OK
SymbolSPY180910C00245000
Delta0.9682
Imp Vol0.2779
Bid38.26
Gamma0.0032
Theoretical38.05
Ask38.42
Theta-0.0397
Intrinsic Value38.13
Volume0
Vega0.0579
Time Value-38.13
Open Interest0
Rho0.1934
Delta / Theta-24.3966


SymbolSPY180910P00245000
Delta-0.0262
Imp Vol0.2652
...