Question

我经常使用pandas groupby生成堆叠表。但后来我经常想要将生成的嵌套关系输出到json。有没有办法从它生成的堆栈表中提取嵌套的json字段？

假设我有一个df：

year office candidate  amount
2010 mayor  joe smith  100.00
2010 mayor  jay gould   12.00
2010 govnr  pati mara  500.00
2010 govnr  jess rapp   50.00
2010 govnr  jess rapp   30.00

我能做到：

grouped = df.groupby('year', 'office', 'candidate').sum()

print grouped
                       amount
year office candidate 
2010 mayor  joe smith   100
            jay gould    12
     govnr  pati mara   500
            jess rapp    80

美丽！当然，我真正喜欢做的是通过沿groups.to_json行的命令获得嵌套的json。但是这个功能不可用。任何解决方法？

所以，我真正想要的是：

{"2010": {"mayor": [
                    {"joe smith": 100},
                    {"jay gould": 12}
                   ]
         }, 
          {"govnr": [
                     {"pati mara":500}, 
                     {"jess rapp": 80}
                    ]
          }
}

唐

Answer 1

我认为pandas没有内置任何内容来创建数据的嵌套字典。下面是一些代码，它们通常适用于使用defaultdict

的MultiIndex系列

嵌套代码遍历MultIndex的每个级别，向字典添加图层，直到最深层被分配给Series值。

In  [99]: from collections import defaultdict

In [100]: results = defaultdict(lambda: defaultdict(dict))

In [101]: for index, value in grouped.itertuples():
     ...:     for i, key in enumerate(index):
     ...:         if i == 0:
     ...:             nested = results[key]
     ...:         elif i == len(index) - 1:
     ...:             nested[key] = value
     ...:         else:
     ...:             nested = nested[key]

In [102]: results
Out[102]: defaultdict(<function <lambda> at 0x7ff17c76d1b8>, {2010: defaultdict(<type 'dict'>, {'govnr': {'pati mara': 500.0, 'jess rapp': 80.0}, 'mayor': {'joe smith': 100.0, 'jay gould': 12.0}})})

In [106]: print json.dumps(results, indent=4)
{
    "2010": {
        "govnr": {
            "pati mara": 500.0, 
            "jess rapp": 80.0
        }, 
        "mayor": {
            "joe smith": 100.0, 
            "jay gould": 12.0
        }
    }
}

Answer 2

我看了上面的解决方案并发现它只适用于3级嵌套。该解决方案适用于任何级别。

var

Answer 3

我知道这是一个老问题，但我最近遇到了同样的问题。这是我的解决方案。我从chrisb的例子中借了很多东西（谢谢！）。

这样做的好处是你可以传递一个lambda来从你想要的任何可枚举的内容中获取最终值，以及每个组。

@echo off
setlocal

rem Get end time
set seconds=10
for /F "tokens=1-4 delims=:.," %%a in ("%time: =0%") do set /A "endTime=1%%a%%b%%c%%d+seconds*100"

echo Start: %time%
echo Working %seconds% seconds, please wait...

set yourNo=0
:loop
set /A yourNo+=1

rem Check if end time had been reached
for /F "tokens=1-4 delims=:.," %%a in ("%time: =0%") do if 1%%a%%b%%c%%d lss %endTime% goto loop

echo End:   %time%
echo This program could complete %yourNo% loops in 10 seconds

在这个问题中，你将这个函数称为：

from collections import defaultdict

def dict_from_enumerable(enumerable, final_value, *groups):
    d = defaultdict(lambda: defaultdict(dict))
    group_count = len(groups)
    for item in enumerable:
        nested = d
        item_result = final_value(item) if callable(final_value) else item.get(final_value)
        for i, group in enumerate(groups, start=1):
            group_val = str(group(item) if callable(group) else item.get(group))
            if i == group_count:
                nested[group_val] = item_result
            else:
                nested = nested[group_val]
    return d

第一个参数也可以是数据数组，甚至不需要pandas。

Answer 4

这是针对此问题的通用递归解决方案：

def df_to_dict(df):
    if df.ndim == 1:
        return df.to_dict()

    ret = {}
    for key in df.index.get_level_values(0):
        sub_df = df.xs(key)
        ret[key] = df_to_dict(sub_df)
    return ret

pandas groupby嵌套json

4 个答案: