如何在Hive中导入复杂的JSON数据

时间:2019-07-05 09:41:37

标签: json dictionary hive

在输入中,我有要在配置单元中导入的文件json:

from sshtunnel import SSHTunnelForwarder
import time
server = SSHTunnelForwarder(
'server_ip_address',
ssh_username="user",
ssh_password="password",
local_bind_address=('127.0.0.1', 8080),
remote_bind_address=('127.0.0.1', 3128),
)

server.start()

print(server.local_bind_port)  # show assigned local port
# work with `SECRET SERVICE` through `server.local_bind_port`.
while True:
    time.sleep(1)
server.stop()

我试图用以下复杂类型捕获此信息:

[
    {
        "code": "ACPBC3P",
        "libelle": "Bon de commande Prime de satisfaction ACP",
        "libelleCourt": "Bon de commande Prime de satisfaction ACP",
        "libelleLong": "Bon de commande Prime de satisfaction ACP",
        "dureeStockage": 24,
        "dureeArchivage": 96,
        "dureeEpuration": 120,
        "dureeStockageReelle": 24,
        "dureeArchivageReelle": 96,
        "dureeEpurationReelle": 120,
        "typologie": {
            "code": "ACP",
            "libelle": "ACP - Activ'projet"
        },
        "sousTypologie": {
            "code": "ACPBC3P",
            "libelle": "BC3P - Bon de commande Prime de satisfaction"
        }
    },
    {
        "code": "ACPC1",
        "libelle": "C1 - Demande d'avoir",
        "libelleCourt": "C1 - Demande d'avoir",
        "libelleLong": "C1 - Demande d'avoir",
        "dureeStockage": 36,
        "dureeArchivage": 84,
        "dureeEpuration": 120,
        "dureeStockageReelle": 36,
        "dureeArchivageReelle": 84,
        "dureeEpurationReelle": 120,
        "typologie": {
            "code": "ACP",
            "libelle": "ACP - Activ'projet"
        },
        "sousTypologie": {
            "code": "ACPC1",
            "libelle": "C1 - Demande d'avoir"
        }
    },
    {
        "code": "ACPC2",
        "libelle": "C2 - Relance fournisseur",
        "libelleCourt": "C2 - Relance fournisseur",
        "libelleLong": "C2 - Relance fournisseur",
        "dureeStockage": 36,
        "dureeArchivage": 84,
        "dureeEpuration": 120,
        "dureeStockageReelle": 36,
        "dureeArchivageReelle": 84,
        "dureeEpurationReelle": 120,
        "typologie": {
            "code": "ACP",
            "libelle": "ACP - Activ'projet"
        },

1 个答案:

答案 0 :(得分:1)

您没有提到任何有关遇到的错误的信息。通常,使用JSON SerDe时要注意两点。

  1. org.apache.hadoop.hive.serde2.JsonSerDe 不支持以方括号'['

  2. 开头的JSON数据
  3. JsonSerDe基于文本SerDe,并且每个换行符都被视为新记录

有效格式:

{"world_rank": "1","country": "China","population": "1388232694","World": "0.185"},
{"world_rank": "2","country": "India","population": "1342512706","World": "0.179"},
{"world_rank": "3","country": "U.S.","population": "326474013","World": "0.043"},
{"world_rank": "4","country": "Indonesia","population": "263510146","World": "0.035"}

无效的格式1:

[
{"world_rank": "1","country": "China","population": "1388232694","World": "0.185"},
{"world_rank": "2","country": "India","population": "1342512706","World": "0.179"},
{"world_rank": "3","country": "U.S.","population": "326474013","World": "0.043"},
{"world_rank": "4","country": "Indonesia","population": "263510146","World": "0.035"}
]

无效的格式2:

  {
    "world_rank": "1",
    "country": "China",
    "population": "1388232694",
    "World": "0.185"
  },
  {
    "world_rank": "2",
    "country": "India",
    "population": "1342512706",
    "World": "0.179"
  },
  {
    "world_rank": "3",
    "country": "U.S.",
    "population": "326474013",
    "World": "0.043"
  },
  {
    "world_rank": "4",
    "country": "Indonesia",
    "population": "263510146",
    "World": "0.035"
  }

输入数据应先预处理为以下格式,然后再加载到Hive表中

{"code":"ACPBC3P","libelle":"Bon de commande Prime de satisfaction ACP","libelleCourt":"Bon de commande Prime de satisfaction ACP","libelleLong":"Bon de commande Prime de satisfaction ACP","dureeStockage":24,"dureeArchivage":96,"dureeEpuration":120,"dureeStockageReelle":24,"dureeArchivageReelle":96,"dureeEpurationReelle":120,"typologie":{"code":"ACP","libelle":"ACP - Activ'projet"},"sousTypologie":{"code":"ACPBC3P","libelle":"BC3P - Bon de commande Prime de satisfaction"}},
{"code":"ACPC1","libelle":"C1 - Demande d'avoir","libelleCourt":"C1 - Demande d'avoir","libelleLong":"C1 - Demande d'avoir","dureeStockage":36,"dureeArchivage":84,"dureeEpuration":120,"dureeStockageReelle":36,"dureeArchivageReelle":84,"dureeEpurationReelle":120,"typologie":{"code":"ACP","libelle":"ACP - Activ'projet"},"sousTypologie":{"code":"ACPC1","libelle":"C1 - Demande d'avoir"}}
{"code":"ACPC2","libelle":"C2 - Relance fournisseur","libelleCourt":"C2 - Relance fournisseur","libelleLong":"C2 - Relance fournisseur","dureeStockage":36,"dureeArchivage":84,"dureeEpuration":120,"dureeStockageReelle":36,"dureeArchivageReelle":84,"dureeEpurationReelle":120,"typologie":{"code":"ACP","libelle":"ACP - Activ'projet"}}

DDL:

CREATE TABLE data (
code STRING,
libelle STRING,
libelleCourt STRING,
libelleLong STRING,
dureeStockage INT,
dureeArchivage INT,
dureeEpuration INT,
dureeStockageReelle INT,
dureeArchivageReelle INT,
dureeEpurationReelle INT,
typologie struct<code: STRING, libelle: STRING>,
sousTypologie struct<code: STRING, libelle: STRING>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;

查询以选择数据:

select soustypologie.code from data;
select typologie.libelle from data;