JSON规范化非常嵌套的JSON数据结构

时间:2018-03-06 23:30:16

标签: python json pandas

我有以下JSON数据结构。我想把它变成Pandas DataFrame。

pandas.io.json json_normalize工作正常,除了'tunnels-in'和'tunnels-out'部分。这些列表中包含一些嵌套的字典。我已经尝试了几乎所有我见过的json_normalize示例的格式,但没有成功。关于我可以工作的所有内容如下。

json_normalize(json_dict [ '数据'] [ 'viptela-OPER-VPN'] [ 'DPI'] [ '流'])

只要我添加变量来定义其他结构,我就无法解决错误。我研究了另外的方法来做到这一点 - 在这里记录 - 似乎有效,但它似乎没有处理任何垂直结构的概念。在这里,我们有一个流列表 - 我希望将每个流展平为单独的列 - 其中每个流的值在同一列的不同行中

https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10

有没有人知道使用normalize函数的方法,同时保留嵌套的字典列表?正如您所看到的,并非每个流都有隧道输入/隧道输出。这是我试图自己压扁的另一个复杂因素。

非常感谢任何想法。

非常感谢,

数据结构

{
  "data": {
    "viptela-oper-vpn": {
      "dpi": {
        "flows": [
          {
            "vpn-id": 1,
            "src-ip": "1.1.0.200",
            "dst-ip": "1.3.0.200",
            "src-port": 65369,
            "dst-port": 1967,
            "proto": "udp",
            "application": "udp",
            "family": "Network Service",
            "active-since": "2018-02-28T22:51:54+00:00",
            "packets": 2,
            "octets": 132,
            "tunnels-in": [
              {
                "index_me": 1,
                "local-tloc": {
                  "ip": "1.1.1.104",
                  "color": "private2",
                  "encap": "ipsec"
                },
                "remote-tloc": {
                  "ip": "1.1.1.103",
                  "color": "private2",
                  "encap": "ipsec"
                },
                "packets": 1,
                "octets": 80,
                "start-time": "2018-02-28T22:51:54+00:00"
              }
            ],
            "tunnels-out": [
              {
                "index_me": 1,
                "local-tloc": {
                  "ip": "1.1.1.104",
                  "color": "private2",
                  "encap": "ipsec"
                },
                "remote-tloc": {
                  "ip": "1.1.1.103",
                  "color": "mpls",
                  "encap": "ipsec"
                },
                "packets": 1,
                "octets": 52,
                "start-time": "2018-02-28T22:51:54+00:00"
              }
            ]
          },
          {
            "vpn-id": 1,
            "src-ip": "1.1.0.200",
            "dst-ip": "1.3.0.200",
            "src-port": 65529,
            "dst-port": 1967,
            "proto": "udp",
            "application": "udp",
            "family": "Network Service",
            "active-since": "2018-02-28T22:52:03+00:00",
            "packets": 2,
            "octets": 132,
            "tunnels-in": [
              {
                "index_me": 1,
                "local-tloc": {
                  "ip": "1.1.1.104",
                  "color": "private2",
                  "encap": "ipsec"
                },
                "remote-tloc": {
                  "ip": "1.1.1.103",
                  "color": "private2",
                  "encap": "ipsec"
                },
                "packets": 1,
                "octets": 80,
                "start-time": "2018-02-28T22:52:03+00:00"
              }
            ],
            "tunnels-out": [
              {
                "index_me": 1,
                "local-tloc": {
                  "ip": "1.1.1.104",
                  "color": "private2",
                  "encap": "ipsec"
                },
                "remote-tloc": {
                  "ip": "1.1.1.103",
                  "color": "mpls",
                  "encap": "ipsec"
                },
                "packets": 1,
                "octets": 52,
                "start-time": "2018-02-28T22:52:03+00:00"
              }
            ]
          },
          {
            "vpn-id": 512,
            "src-ip": "69.26.45.133",
            "dst-ip": "198.19.200.2",
            "src-port": 11895,
            "dst-port": 22,
            "proto": "tcp",
            "application": "ssh",
            "family": "Encrypted",
            "active-since": "2018-02-28T22:42:15+00:00",
            "packets": 1498,
            "octets": 797954
          },
          {
            "vpn-id": 512,
            "src-ip": "198.19.200.2",
            "dst-ip": "69.26.45.139",
            "src-port": 514,
            "dst-port": 514,
            "proto": "udp",
            "application": "syslog",
            "family": "Application Service",
            "active-since": "2018-02-28T22:50:59+00:00",
            "packets": 8,
            "octets": 2820
          }
        ]
      }
    }
  }
}

功能到目前为止

def myprint(file):
    file_var = ''
    with open(file) as f:
        file_var = f.read()
    extract_json_dict = re.compile('(\\n{\\n)(.*)(\\n}\\n)', re.DOTALL)
    json_string = extract_json_dict.search(file_var).group(0)
    json_dict = json.loads(json_string)
    df = json_normalize(json_dict['data']['viptela-oper-vpn']['dpi']['flows'])

立即显示的列

['active-since','application','dst-ip','dst-port','family','octets', 'packets','proto','src-ip','src-port','tunnels-in','tunnels-out', 'VPN-ID']

除了上面显示的内容之外我想添加的列

本质上,将列表为值的那两列“展平”为其他列,并将每个流的值放在一个唯一的行中。

[ '隧道-in_index_me', '隧道-in_remote-tloc_ip', '隧道-in_remote-tloc_color', '隧道-in_remote-tloc_encap', '隧道-out_remote-tloc_ip']

更新3/8/2018

对于包含字典列表的列,这似乎就是我想做的。但它需要流号的[0]标识符。不知道是否有人知道如何让这种方法适用于所有流程 - 不是一次一个。如果可以这样做,我应该能够根据索引号连接或合并。使用单个json_normalize行完成整个事情会更好 - 但是除了[0]的问题之外,这似乎还有一个额外的问题,即并非所有流数都有嵌套的字典列表。我会继续尝试这个,但任何想法都会受到赞赏。

json_normalize(json_dict['data']['viptela-oper-vpn']['dpi']['flows'][0]['tunnels-in'])

1 个答案:

答案 0 :(得分:0)

我现在正为此苦苦挣扎。我通过将索引拉到另一个pandas.dataframe中来解决按索引[0]进行的选择。因此:

df = json_normalize(pd.DataFrame(list(json_dict['data']['viptela-oper-vpn']['dpi']['flows']))['tunnels-in'])