从OECD API读取数据到python(和pandas)

时间:2016-11-12 17:50:24

标签: python api pandas

我试图将OECD API(https://data.oecd.org/api/sdmx-json-documentation/)中的数据下载到python中。

到目前为止,我设法以SDMX-JSON格式下载数据(并将其转换为JSON):

OECD_ROOT_URL = "http://stats.oecd.org/SDMX-JSON/data"

def make_OECD_request(dsname, dimensions, params = None, root_dir = OECD_ROOT_URL):
    """Make URL for the OECD API and return a response"""
    """4 dimensions: location, subject, measure, frequency"""

    if not params:
        params = {}

    dim_args = ['+'.join(d) for d in dimensions]
    dim_str = '.'.join(dim_args)

    url = root_dir + '/' + dsname + '/' + dim_str + '/all'

    print('Requesting URL ' + url)
    return rq.get(url = url, params = params)

response = make_OECD_request('MEI'
    , [['USA', 'CZE'], [], [], ['M']]
    , {'startTime': '2009-Q1', 'endTime': '2010-Q1'})


if (response.status_code == 200):
    json = response.json()

如何将数据集转换为pandas.DataFrame?我尝试过pandas.read_json()和pandasdmx库,但我无法解决这个问题。

4 个答案:

答案 0 :(得分:3)

更新

从OECD API自动下载数据的功能现在可以在我的Python库CIF(Composite Indicators Framework的缩写,可通过pip安装)中找到:

from cif import cif
data, subjects, measures = cif.createDataFrameFromOECD(countries = ['USA'], dsname = 'MEI', frequency = 'M')

原始答案:

如果您需要Pandas DataFrame格式的数据,最好将您的请求发送给OECD,并附加参数'dimensionAtObservation': 'AllDimensions',这样会产生更全面的JSON文件。

使用以下功能下载数据:

import requests as rq
import pandas as pd
import re

OECD_ROOT_URL = "http://stats.oecd.org/SDMX-JSON/data"

def make_OECD_request(dsname, dimensions, params = None, root_dir = OECD_ROOT_URL):
    # Make URL for the OECD API and return a response
    # 4 dimensions: location, subject, measure, frequency
    # OECD API: https://data.oecd.org/api/sdmx-json-documentation/#d.en.330346

    if not params:
        params = {}

    dim_args = ['+'.join(d) for d in dimensions]
    dim_str = '.'.join(dim_args)

    url = root_dir + '/' + dsname + '/' + dim_str + '/all'

    print('Requesting URL ' + url)
    return rq.get(url = url, params = params)


def create_DataFrame_from_OECD(country = 'CZE', subject = [], measure = [], frequency = 'M',  startDate = None, endDate = None):     
    # Request data from OECD API and return pandas DataFrame

    # country: country code (max 1)
    # subject: list of subjects, empty list for all
    # measure: list of measures, empty list for all
    # frequency: 'M' for monthly and 'Q' for quarterly time series
    # startDate: date in YYYY-MM (2000-01) or YYYY-QQ (2000-Q1) format, None for all observations
    # endDate: date in YYYY-MM (2000-01) or YYYY-QQ (2000-Q1) format, None for all observations

    # Data download

    response = make_OECD_request('MEI'
                                 , [[country], subject, measure, [frequency]]
                                 , {'startTime': startDate, 'endTime': endDate, 'dimensionAtObservation': 'AllDimensions'})

    # Data transformation

    if (response.status_code == 200):

        responseJson = response.json()

        obsList = responseJson.get('dataSets')[0].get('observations')

        if (len(obsList) > 0):

            print('Data downloaded from %s' % response.url)

            timeList = [item for item in responseJson.get('structure').get('dimensions').get('observation') if item['id'] == 'TIME_PERIOD'][0]['values']
            subjectList = [item for item in responseJson.get('structure').get('dimensions').get('observation') if item['id'] == 'SUBJECT'][0]['values']
            measureList = [item for item in responseJson.get('structure').get('dimensions').get('observation') if item['id'] == 'MEASURE'][0]['values']

            obs = pd.DataFrame(obsList).transpose()
            obs.rename(columns = {0: 'series'}, inplace = True)
            obs['id'] = obs.index
            obs = obs[['id', 'series']]
            obs['dimensions'] = obs.apply(lambda x: re.findall('\d+', x['id']), axis = 1)
            obs['subject'] = obs.apply(lambda x: subjectList[int(x['dimensions'][1])]['id'], axis = 1)
            obs['measure'] = obs.apply(lambda x: measureList[int(x['dimensions'][2])]['id'], axis = 1)
            obs['time'] = obs.apply(lambda x: timeList[int(x['dimensions'][4])]['id'], axis = 1)
            obs['names'] = obs['subject'] + '_' + obs['measure']

            data = obs.pivot_table(index = 'time', columns = ['names'], values = 'series')

            return(data)

        else:

            print('Error: No available records, please change parameters')

    else:

        print('Error: %s' % response.status_code)

您可以创建以下请求:

data = create_DataFrame_from_OECD(country = 'CZE', subject = ['LOCOPCNO'])
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'Q', startDate = '2009-Q1', endDate = '2010-Q1')
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'M', startDate = '2009-01', endDate = '2010-12')
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'M', subject = ['B6DBSI01'])
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'Q', subject = ['B6DBSI01'])

答案 1 :(得分:2)

最新版本的pandasdmx(pandasdmx.readthedocs.io)修复了以前在sdmx-json中访问OECD数据的问题。

答案 2 :(得分:1)

您可以使用此类代码从源恢复数据。

from urllib.request import urlopen
import json

URL = 'http://stats.oecd.org/SDMX-JSON/data/MEI/USA+CZE...M/all'

response = urlopen(URL).read()
responseDict = json.loads(str(response)[2:-1])
print (responseDict.keys())
print (len(responseDict['dataSets']))

以下是此代码的输出。

dict_keys(['header', 'structure', 'dataSets'])
1

如果你对[2:-1](我会是)的外观感到好奇,因为由于某种原因我不知道 str 函数会留下一些无关的字符。字符串的开头和结尾,它转换传递给它的字节数组。记录 json.loads 需要字符串作为输入。

这是我用来达到这一点的代码。

>>> from urllib.request import urlopen
>>> import json
>>> URL = 'http://stats.oecd.org/SDMX-JSON/data/MEI/USA+CZE...M/all'
>>> response = urlopen(URL).read()
>>> len(response)
9886387
>>> response[:50]
b'{"header":{"id":"1975590b-346a-47ee-8d99-6562ccc11'
>>> str(response[:50])
'b\'{"header":{"id":"1975590b-346a-47ee-8d99-6562ccc11\''
>>> str(response[-50:])
'b\'"uri":"http://www.oecd.org/contact/","text":""}]}}\''

我知道这不是一个完整的解决方案,因为你必须坚持使用 dataSets 结构来放入放入pandas的数据。它是一个列表,但你可以从这个草图开始探索它。

答案 3 :(得分:0)

The documentation the original question points to并没有提到API接受参数contentType,该参数可以设置为csv。这样一来,与熊猫一起使用就变得微不足道了。

import pandas as pd

def get_from_oecd(sdmx_query):
    return pd.read_csv(
        f"https://stats.oecd.org/SDMX-JSON/data/{sdmx_query}?contentType=csv"
    )

print(get_from_oecd("MEI_FIN/IRLT.AUS.M/OECD").head())