我有以下JSON数据结构。我想把它变成Pandas DataFrame。
pandas.io.json json_normalize工作正常,除了'tunnels-in'和'tunnels-out'部分。这些列表中包含一些嵌套的字典。我已经尝试了几乎所有我见过的json_normalize示例的格式,但没有成功。关于我可以工作的所有内容如下。
json_normalize(json_dict [ '数据'] [ 'viptela-OPER-VPN'] [ 'DPI'] [ '流'])
只要我添加变量来定义其他结构,我就无法解决错误。我研究了另外的方法来做到这一点 - 在这里记录 - 似乎有效,但它似乎没有处理任何垂直结构的概念。在这里,我们有一个流列表 - 我希望将每个流展平为单独的列 - 其中每个流的值在同一列的不同行中
https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10
有没有人知道使用normalize函数的方法,同时保留嵌套的字典列表?正如您所看到的,并非每个流都有隧道输入/隧道输出。这是我试图自己压扁的另一个复杂因素。
非常感谢任何想法。
非常感谢,
数据结构
{
"data": {
"viptela-oper-vpn": {
"dpi": {
"flows": [
{
"vpn-id": 1,
"src-ip": "1.1.0.200",
"dst-ip": "1.3.0.200",
"src-port": 65369,
"dst-port": 1967,
"proto": "udp",
"application": "udp",
"family": "Network Service",
"active-since": "2018-02-28T22:51:54+00:00",
"packets": 2,
"octets": 132,
"tunnels-in": [
{
"index_me": 1,
"local-tloc": {
"ip": "1.1.1.104",
"color": "private2",
"encap": "ipsec"
},
"remote-tloc": {
"ip": "1.1.1.103",
"color": "private2",
"encap": "ipsec"
},
"packets": 1,
"octets": 80,
"start-time": "2018-02-28T22:51:54+00:00"
}
],
"tunnels-out": [
{
"index_me": 1,
"local-tloc": {
"ip": "1.1.1.104",
"color": "private2",
"encap": "ipsec"
},
"remote-tloc": {
"ip": "1.1.1.103",
"color": "mpls",
"encap": "ipsec"
},
"packets": 1,
"octets": 52,
"start-time": "2018-02-28T22:51:54+00:00"
}
]
},
{
"vpn-id": 1,
"src-ip": "1.1.0.200",
"dst-ip": "1.3.0.200",
"src-port": 65529,
"dst-port": 1967,
"proto": "udp",
"application": "udp",
"family": "Network Service",
"active-since": "2018-02-28T22:52:03+00:00",
"packets": 2,
"octets": 132,
"tunnels-in": [
{
"index_me": 1,
"local-tloc": {
"ip": "1.1.1.104",
"color": "private2",
"encap": "ipsec"
},
"remote-tloc": {
"ip": "1.1.1.103",
"color": "private2",
"encap": "ipsec"
},
"packets": 1,
"octets": 80,
"start-time": "2018-02-28T22:52:03+00:00"
}
],
"tunnels-out": [
{
"index_me": 1,
"local-tloc": {
"ip": "1.1.1.104",
"color": "private2",
"encap": "ipsec"
},
"remote-tloc": {
"ip": "1.1.1.103",
"color": "mpls",
"encap": "ipsec"
},
"packets": 1,
"octets": 52,
"start-time": "2018-02-28T22:52:03+00:00"
}
]
},
{
"vpn-id": 512,
"src-ip": "69.26.45.133",
"dst-ip": "198.19.200.2",
"src-port": 11895,
"dst-port": 22,
"proto": "tcp",
"application": "ssh",
"family": "Encrypted",
"active-since": "2018-02-28T22:42:15+00:00",
"packets": 1498,
"octets": 797954
},
{
"vpn-id": 512,
"src-ip": "198.19.200.2",
"dst-ip": "69.26.45.139",
"src-port": 514,
"dst-port": 514,
"proto": "udp",
"application": "syslog",
"family": "Application Service",
"active-since": "2018-02-28T22:50:59+00:00",
"packets": 8,
"octets": 2820
}
]
}
}
}
}
功能到目前为止
def myprint(file):
file_var = ''
with open(file) as f:
file_var = f.read()
extract_json_dict = re.compile('(\\n{\\n)(.*)(\\n}\\n)', re.DOTALL)
json_string = extract_json_dict.search(file_var).group(0)
json_dict = json.loads(json_string)
df = json_normalize(json_dict['data']['viptela-oper-vpn']['dpi']['flows'])
立即显示的列
['active-since','application','dst-ip','dst-port','family','octets', 'packets','proto','src-ip','src-port','tunnels-in','tunnels-out', 'VPN-ID']
除了上面显示的内容之外我想添加的列
本质上,将列表为值的那两列“展平”为其他列,并将每个流的值放在一个唯一的行中。
[ '隧道-in_index_me', '隧道-in_remote-tloc_ip', '隧道-in_remote-tloc_color', '隧道-in_remote-tloc_encap', '隧道-out_remote-tloc_ip']
更新3/8/2018
对于包含字典列表的列,这似乎就是我想做的。但它需要流号的[0]标识符。不知道是否有人知道如何让这种方法适用于所有流程 - 不是一次一个。如果可以这样做,我应该能够根据索引号连接或合并。使用单个json_normalize行完成整个事情会更好 - 但是除了[0]的问题之外,这似乎还有一个额外的问题,即并非所有流数都有嵌套的字典列表。我会继续尝试这个,但任何想法都会受到赞赏。
json_normalize(json_dict['data']['viptela-oper-vpn']['dpi']['flows'][0]['tunnels-in'])
答案 0 :(得分:0)
我现在正为此苦苦挣扎。我通过将索引拉到另一个pandas.dataframe中来解决按索引[0]
进行的选择。因此:
df = json_normalize(pd.DataFrame(list(json_dict['data']['viptela-oper-vpn']['dpi']['flows']))['tunnels-in'])