加载CKAN数据集

时间:2017-07-31 14:30:21

标签: python python-3.x ckan

我对ckan有一些疑问:

如何:

  1. 从网络
  2. 加载CKAN数据集
  3. 将此数据集转换为pandas dataframe
  4. 我需要在ckan网站上注册才能查询数据?

    我正在使用Pyhton 3.6.1

    编辑2: 我曾尝试过以下代码:

     import urllib
    url = 'http://dados.cvm.gov.br/api/action/datastore_search?resource_id=92741280-58fc-446b-b436-931faaca4fb4&q=CNPJ_FUNDO:11.286.399/0001-35'
    fileobj = urllib.request.urlopen(url)
    print(fileobj.read())
    

    但是,结果是这样的:

      

    B' {"帮助&#34 ;:   " http://dados.cvm.gov.br/api/3/action/help_show?name=datastore_search&#34 ;,   "成功":真实,"结果":{" resource_id":   " 92741280-58fc-446b-b436-931faaca4fb4"," fields":[{" type":" int4",   " id":" _id"},{" type":" text"," id":&#34 ; CNPJ_FUNDO"},{" type":   " timestamp"," id":" DT_COMPTC"},{" type":" numeric",&#34 ; ID&#34 ;:   " VL_TOTAL"},{"输入":"数字"," id":" VL_QUOTA"},{& #34;类型&#34 ;:   "数字"," id":" VL_PATRIM_LIQ"},{"类型":"数字",&#34 ; ID&#34 ;:   " CAPTC_DIA"},{"输入":"数字"," id":" RESG_DIA"},{& #34;类型&#34 ;:   "数字"," id":" NR_COTST"},{"类型":" int8",&#34 ; id":" _full_count"},   {"输入":" float4"," id":" rank"}]," q":   " CNPJ_FUNDO:11.286.399 / 0001-35","记录":[]," _links":{" start":   " / api / action / datastore_search?q = CNPJ_FUNDO%3A11.286.399%2F0001-35& resource_id = 92741280-58fc-446b-b436-931faaca4fb4"," next":   " / API /动作/ datastore_search Q = CNPJ_FUNDO%3A11.286.399%2F0001-35&安培;偏移量= 100安培; RESOURCE_ID = 92741280-58fc-446B-b436-931faaca4fb4"}}}'

    我需要像this image

    这样的结果

1 个答案:

答案 0 :(得分:1)

  
      
  1. 从网络
  2. 加载CKAN数据集   

您链接的网站在“API de Dados”链接中有一个Python示例:

import urllib
url = 'http://dados.cvm.gov.br/api/action/datastore_search?resource_id=92741280-58fc-446b-b436-931faaca4fb4&limit=5&q=title:jones'
fileobj = urllib.urlopen(url)
print fileobj.read()
  
      
  1. 将此数据集转换为pandas dataframe
  2.   

像处理任何JSON数据集一样,解析它并加载到数据框中(这里没有特定的ckan):

>>> import pandas as pd
>>> import json
>>> response = json.loads(fileobj.read())
>>> pd.DataFrame(response['result']['records'])

  CAPTC_DIA          CNPJ_FUNDO            DT_COMPTC NR_COTST RESG_DIA  \
0      0.00  00.017.024/0001-53  2017-07-03T00:00:00        1     0.00   
1      0.00  00.017.024/0001-53  2017-07-04T00:00:00        1     0.00   
2      0.00  00.017.024/0001-53  2017-07-05T00:00:00        1     0.00   
3      0.00  00.017.024/0001-53  2017-07-06T00:00:00        1     0.00   
4      0.00  00.017.024/0001-53  2017-07-07T00:00:00        1     0.00   

  VL_PATRIM_LIQ         VL_QUOTA    VL_TOTAL  _id  
0    1111752.99  25.249352000000  1111831.24    1  
1    1112087.29  25.256944400000  1112268.26    2  
2    1112415.28  25.264393500000  1112716.06    3  
3    1112754.06  25.272087600000  1113165.75    4  
4    1113096.62  25.279867600000  1113293.06    5  
  

我需要在ckan网站上有一个注册来查询数据吗?

您无需在链接的网站上注册,我无需注册即可检索数据。我更喜欢使用requests库:

import requests
import pandas as pd

params = params={
    'resource_id': '92741280-58fc-446b-b436-931faaca4fb4', 
    'limit': 5,
}
url = 'http://dados.cvm.gov.br/api/action/datastore_search'
r = requests.get(url, params=params).json()

df = pd.DataFrame(r['result']['records'])

看起来像limit and offset parameters probably behave like in SQL。您可能必须将列转换为数字/日期类型,这也不是ckan特有的,您可以在pandas文档中找到有关如何执行此操作的答案。

>>> df.describe()
            _id
count  5.000000
mean   3.000000
std    1.581139
min    1.000000
25%    2.000000
50%    3.000000
75%    4.000000
max    5.000000

转换很容易:

>>> for col in ('CAPTC_DIA', 'NR_COTST', 'RESG_DIA', 'VL_PATRIM_LIQ', 'VL_QUOTA', 'VL_TOTAL'):
...    df[col] = pd.to_numeric(df[col])

>>> df['DT_COMPTC'] = pd.to_datetime(df['DT_COMPTC'])

>>> df.describe()
       CAPTC_DIA  NR_COTST  RESG_DIA  VL_PATRIM_LIQ   VL_QUOTA      VL_TOTAL  \
count        5.0       5.0       5.0   5.000000e+00   5.000000  5.000000e+00
mean         0.0       1.0       0.0   1.112421e+06  25.264529  1.112655e+06
std          0.0       0.0       0.0   5.303356e+02   0.012045  6.123444e+02
min          0.0       1.0       0.0   1.111753e+06  25.249352  1.111831e+06
25%          0.0       1.0       0.0   1.112087e+06  25.256944  1.112268e+06
50%          0.0       1.0       0.0   1.112415e+06  25.264394  1.112716e+06
75%          0.0       1.0       0.0   1.112754e+06  25.272088  1.113166e+06
max          0.0       1.0       0.0   1.113097e+06  25.279868  1.113293e+06

            _id  
count  5.000000  
mean   3.000000  
std    1.581139  
min    1.000000  
25%    2.000000  
50%    3.000000  
75%    4.000000  
max    5.000000