Python - 为嵌套的json文件添加字段和标签

时间:2016-12-18 15:55:19

标签: python json pandas dictionary dataframe

我的数据框如下:

Name_ID | URL                    | Count | Rating
------------------------------------------------
ABC     | www.example.com/ABC    | 10    | 5
123     | www.example.com/123    | 9     | 4
XYZ     | www.example.com/XYZ    | 5     | 2
ABC111  | www.example.com/ABC111 | 5     | 2
ABC121  | www.example.com/ABC121 | 5     | 2
222     | www.example.com/222    | 5     | 3
abc222  | www.example.com/abc222 | 4     | 2
ABCaaa  | www.example.com/ABCaaa | 4     | 2

我正在尝试按如下方式创建JSON:

{
    "name": "sampledata",
    "children": [
        {
            "name": 9,
            "children": [
                {
                    "name": 4,
                    "children": [
                        {
                            "name": "123",
                            "size": 100
                        }
                    ]
                }
            ]
        },
        {
            "name": 10,
            "children": [
                {
                    "name": 5,
                    "children": [
                        {
                            "name": "ABC",
                            "size": 100
                        }
                    ]
                }
            ]
        },
        {
            "name": 4,
            "children": [
                {
                    "name": 2,
                    "children": [
                        {
                            "name": "abc222",
                            "size": 50
                        },
                        {
                            "name": "ABCaaa",
                            "size": 50
                        }
                    ]
                }
            ]
        },
        {
            "name": 5,
            "children": [
                {
                    "name": 2,
                    "children": [
                        {
                            "name": "ABC",
                            "size": 16
                        },
                        {
                            "name": "ABC111",
                            "size": 16
                        },
                        {
                            "name": "ABC121",
                            "size": 16
                        }
                    ]
                },
                {
                    "name": 3,
                    "children": [
                        {
                            "name": "222",
                            "size": 50
                        }
                    ]
                }
            ]
        }
    ]
}

为了做到这一点:

  • 我正在尝试在创建json时向json添加"name""children"等标签。

我试过像

这样的东西
results = [{"name": i, "children": j} for i,j in results.items()]

但我相信它不会标记它。

  • 此外,添加标签为“"尺寸&#34”的另一个字段;我计划根据公式计算该字段:

    (Rating*Count*10000)/number_of_children_to_the_immediate_parent
    

这是我的脏代码:

import pandas as pd
from collections import defaultdict
import json

data =[('ABC', 'www.example.com/ABC', 10   , 5), ('123', 'www.example.com/123', 9, 4), ('XYZ', 'www.example.com/XYZ', 5, 2), ('ABC111', 'www.example.com/ABC111', 5, 2), ('ABC121', 'www.example.com/ABC121', 5, 2), ('222', 'www.example.com/222', 5, 3), ('abc222', 'www.example.com/abc222', 4, 2), ('ABCaaa', 'www.example.com/ABCaaa', 4, 2)]

df = pd.DataFrame(data, columns=['Name', 'URL', 'Count', 'Rating'])

gp = df.groupby(['Count'])

dict_json = {"name": "flare"}
children = []

for name, group in gp:
    temp = {}
    temp["name"] = name
    temp["children"] = []

    rgp = group.groupby(['Rating'])
    for n, g in rgp:
        temp2 = {}
        temp2["name"] = n
        temp2["children"] = g.reset_index().T.to_dict().values()
        for t in temp2["children"]:
            t["size"] = (t["Rating"] * t["Count"] * 10000) / len(temp2["children"])
            t["name"] = t["Name"]
            del t["Count"]
            del t["Rating"]
            del t["URL"]
            del t["Name"]
            del t["index"]
        temp["children"].append(temp2)
    children.append(temp)

dict_json["children"] = children

print json.dumps(dict_json, indent=4)

虽然上面的代码确实打印了我需要的东西,但我正在寻找更高效,更清晰的方法来做同样的事情,主要是因为实际的数据集可能更加嵌套和复杂。任何帮助/建议将不胜感激。

3 个答案:

答案 0 :(得分:9)

相当有趣的问题和一个很好的问题!

您可以通过重新组织循环内的代码并使用list comprehensions来改进您的方法。无需删除内容并在循环中引入临时变量:

dict_json = {"name": "flare"}

children = []
for name, group in gp:
    temp = {"name": name, "children": []}

    rgp = group.groupby(['Rating'])
    for n, g in rgp:
        temp["children"].append({
            "name": n,
            "children": [
                {"name": row["Name"],
                 "size": row["Rating"] * row["Count"] * 10000 / len(g)}
                for _, row in g.iterrows()
            ]
        })

    children.append(temp)

dict_json["children"] = children

或者,"包裹"版本:

dict_json = {
    "name": "flare", 
    "children": [
        {
            "name": name, 
            "children": [
                {
                    "name": n,
                    "children": [
                        {
                            "name": row["Name"],
                            "size": row["Rating"] * row["Count"] * 10000 / len(g)
                        } for _, row in g.iterrows()
                    ]
                } for n, g in group.groupby(['Rating'])
            ]
        } for name, group in gp
    ]
}

我为您打印了以下字典示例输入数据框:

{
    "name": "flare", 
    "children": [
        {
            "name": 4, 
            "children": [
                {
                    "name": 2, 
                    "children": [
                        {
                            "name": "abc222", 
                            "size": 40000
                        }, 
                        {
                            "name": "ABCaaa", 
                            "size": 40000
                        }
                    ]
                }
            ]
        }, 
        {
            "name": 5, 
            "children": [
                {
                    "name": 2, 
                    "children": [
                        {
                            "name": "XYZ", 
                            "size": 33333
                        }, 
                        {
                            "name": "ABC111", 
                            "size": 33333
                        }, 
                        {
                            "name": "ABC121", 
                            "size": 33333
                        }
                    ]
                }, 
                {
                    "name": 3, 
                    "children": [
                        {
                            "name": "222", 
                            "size": 150000
                        }
                    ]
                }
            ]
        }, 
        {
            "name": 9, 
            "children": [
                {
                    "name": 4, 
                    "children": [
                        {
                            "name": "123", 
                            "size": 360000
                        }
                    ]
                }
            ]
        }, 
        {
            "name": 10, 
            "children": [
                {
                    "name": 5, 
                    "children": [
                        {
                            "name": "ABC", 
                            "size": 500000
                        }
                    ]
                }
            ]
        }
    ]
}

答案 1 :(得分:3)

如果我理解正确你要做的是将一个groupby放入嵌套的json中,如果是这种情况,那么你可以使用pandas groupby并将其转换为嵌套的列表列表,如下所示:

lol = pd.DataFrame(df.groupby(['Count','Rating'])\
               .apply(lambda x: list(x['Name_ID']))).reset_index().values.tolist()
lol应该看起来像这样:

[['10', '5', ['ABC']],
['4', '2', ['abc222', 'ABCaaa']],
['5', '2', ['XYZ ', 'ABC111', 'ABC121']],
['5', '3', ['222 ']],
['9', '4', ['123 ']]]

之后你可以循环lol把它放到dict中,但是既然你想设置嵌套项,你就必须使用autovivification(检查出来):

class autovividict(dict):
   def __missing__(self, key):
      value = self[key] = type(self)()
      return value

d = autovividict()
for l in lol:
    d[l[0]][l[1]] = l[2]

现在您可以使用json包进行打印和导出:

print json.dumps(d,indent=2)

如果你需要多个groupby,你可以用pandas连接你的组,转换为lol,删除任何nans,然后循环,让我知道一个完整的例子是否有帮助。

答案 2 :(得分:1)

设置

from io import StringIO
import pandas as pd

txt = """Name_ID,URL,Count,Rating
ABC,www.example.com/ABC,10,5
123,www.example.com/123,9,4
XYZ,www.example.com/XYZ,5,2
ABC111,www.example.com/ABC111,5,2
ABC121,www.example.com/ABC121,5,2
222,www.example.com/222,5,3
abc222,www.example.com/abc222,4,2
ABCaaa,www.example.com/ABCaaa,4,2"""

df = pd.read_csv(StringIO(txt))

<强> size
预先计算它

df['size'] = df.Count.mul(df.Rating) \
                     .mul(10000) \
                     .div(df.groupby(
                        ['Count', 'Rating']).Name_ID.transform('count')
                     ).astype(int)

<强> 溶液
创建递归函数

def h(d):
    if isinstance(d, pd.Series): d = d.to_frame().T
    rec_cond = d.index.nlevels > 1 or d.index.nunique() > 1
    return {'name': str(d.index[0]), 'size': str(d['size'].iloc[0])} if not rec_cond else \
        [dict(name=str(n), children=h(g.xs(n))) for n, g in d.groupby(level=0)]

演示

import json

my_dict = dict(name='flare', children=h(df.set_index(['Count', 'Rating', 'Name_ID'])))

json.dumps(my_dict)
  

'{“name”:“flare”,“children”:[{“name”:“4”,“children”:[{“name”:“2”,“children”:[{“name” :“ABCaaa”,“children”:{“name”:“ABCaaa”,“size”:“40000”}},{“name”:“abc222”,“children”:{“name”:“abc222”, “size”:“40000”}}]}]},{“name”:“5”,“children”:[{“name”:“2”,“children”:[{“name”:“ABC111” ,“children”:{“name”:“ABC111”,“size”:“33333”}},{“name”:“ABC121”,“children”:{“name”:“ABC121”,“size”: “33333”}},{“name”:“XYZ”,“children”:{“name”:“XYZ”,“size”:“33333”}}]},{“name”:“3”,“儿童“:{”name“:”222“,”size“:”150000“}}]},{”name“:”9“,”children“:[{”name“:”4“,”children“ :{“name”:“123”,“size”:“360000”}}]},{“name”:“10”,“children”:[{“name”:“5”,“children”:{ “name”:“ABC”,“size”:“500000”}}]}]}'

my_dict

{'children': [{'children': [{'children': [{'children': {'name': 'ABCaaa',
        'size': '40000'},
       'name': 'ABCaaa'},
      {'children': {'name': 'abc222', 'size': '40000'}, 'name': 'abc222'}],
     'name': '2'}],
   'name': '4'},
  {'children': [{'children': [{'children': {'name': 'ABC111', 'size': '33333'},
       'name': 'ABC111'},
      {'children': {'name': 'ABC121', 'size': '33333'}, 'name': 'ABC121'},
      {'children': {'name': 'XYZ', 'size': '33333'}, 'name': 'XYZ'}],
     'name': '2'},
    {'children': {'name': '222', 'size': '150000'}, 'name': '3'}],
   'name': '5'},
  {'children': [{'children': {'name': '123', 'size': '360000'}, 'name': '4'}],
   'name': '9'},
  {'children': [{'children': {'name': 'ABC', 'size': '500000'}, 'name': '5'}],
   'name': '10'}],
 'name': 'flare'}