我想要字典中的多索引熊猫DataFrame。内部字典包含相同长度的列表/ numpy数组。
x = {'a': {'x': [0, 1, 2], 'y': [1 ,2 ,3]},
'b': {'x': [4, 6, 8], 'y': [9, 8, 7]}}
some_function(x)
=>
x y <- first index
0 1 2 0 1 2 <- second index
a 0 1 2 1 2 3
b 4 6 8 9 8 7
这是我已经尝试过的方法,但是有没有更有效的方法?只喜欢大熊猫吗?还是有熊猫功能可以解决这个问题?
def dict_of_dicts_of_collections_to_multiindex_df(dict_of_dicts_of_collections):
x = dict_of_dicts_of_collections
result = {}
for outer_key, intermediate_dict in x.items():
result[outer_key] = {}
for intermediate_key, collection in intermediate_dict.items():
try:
for i, e in enumerate(collection):
result[outer_key][(intermediate_key, i)] = e
except TypeError:
pass
return pd.DataFrame(result).T
答案 0 :(得分:1)
您是否考虑过进行逆向工程?我的意思是建立一个多索引数据框并检查当您print(df.to_dict())
时看起来是什么样子。使用此answer的第一部分,我们可以将输出作为您想要的输出,
import pandas as pd
data = [[0, 1, 2, 1 ,2 ,3],
[4, 6, 8, 9, 8, 7]]
df = pd.DataFrame(data)
df.colums = pd.MultiIndex.from_product([['x','y'], [0,1,2]])
df.index = ['a', 'b']
print(df.to_dict())
{('x', 0): {'a': 0, 'b': 4},
('x', 1): {'a': 1, 'b': 6},
('x', 2): {'a': 2, 'b': 8},
('y', 0): {'a': 1, 'b': 9},
('y', 1): {'a': 2, 'b': 8},
('y', 2): {'a': 3, 'b': 7}}
因此,如果您可以将数据作为两个列表,则可以使用pd.MultiIndex.from_product
技巧。
否则
import pandas as pd
data = {'a': {'x': [0, 1, 2], 'y': [1 ,2 ,3]},
'b': {'x': [4, 6, 8], 'y': [9, 8, 7]}}
df = pd.DataFrame(data).T
# you obeserve that expand list to columns
df["x"].apply(pd.Series)
# then using this for every column
# and again pd.MultiIndex.from_product
# gives you the desired output
cols = df.columns
df = pd.concat([df[col].apply(pd.Series) for col in cols], axis=1)
df.columns = pd.MultiIndex.from_product([cols, [0,1,2]])
答案 1 :(得分:1)
我针对此问题创建了两种替代方法,并对结果计时。还包括其他答案以及原始功能。
from copy import deepcopy
import pandas as pd
from collections import defaultdict
import numpy as np
x = {'a': {'x': [0, 1, 2], 'y': [1 ,2 ,3]},
'b': {'x': [4, 6, 8], 'y': [9, 8, 7]}}
test = deepcopy(x)
for i in range(1000):
test.update({f'a_{i}':test['a']})
test2 = {k:{key: val*300 for key, val in v.items()} for k, v in x.items()}
print(len(test2['a']['x']))
for i in range(1000):
test2.update({f'a_{i}':test2['a']})
def dict_of_dicts_of_collections_to_multiindex_df(dict_of_dicts_of_collections):
x = dict_of_dicts_of_collections
result = {}
for outer_key, intermediate_dict in x.items():
result[outer_key] = {}
for intermediate_key, collection in intermediate_dict.items():
try:
for i, e in enumerate(collection):
result[outer_key][(intermediate_key, i)] = e
except TypeError:
pass
return pd.DataFrame(result).T
def out_from_other_answer(data):
df = pd.DataFrame(data).T
cols = df.columns
df = pd.concat([df[col].apply(pd.Series) for col in cols], axis=1)
#tweaked to avoid hardcoding [0, 1, 2]
df.columns = pd.MultiIndex.from_product([cols, range(len(df.columns)//len(cols))])
return df
def out2(dict_of_dicts):
df = pd.DataFrame(list(dict_of_dicts.values()))
out_df = pd.concat([pd.DataFrame(df[col].values.tolist())
for col in df.columns
],
axis=1,
keys=df.columns,
)
out_df.index = dict_of_dicts.keys()
return out_df
def out3(data):
temp = defaultdict(list)
for d in list(data.values()):
for k, v in d.items():
temp[k].append(v)
out = pd.concat([pd.DataFrame(v) for v in temp.values()], axis=1, keys=temp.keys())
out.index = data.keys()
return out
让我们看看结果。 带有小数据x。
%timeit dict_of_dicts_of_collections_to_multiindex_df(x)
%timeit out_from_other_answer(x)
%timeit out2(x)
%timeit out3(x)
1.63 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.49 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.44 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.49 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
具有更多行但测试相同列。
%timeit dict_of_dicts_of_collections_to_multiindex_df(test)
%timeit out_from_other_answer(test)
%timeit out2(test)
%timeit out3(test)
70.3 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
494 ms ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.81 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.37 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
具有更多行和更多列的test2
%timeit dict_of_dicts_of_collections_to_multiindex_df(test2)
%timeit out_from_other_answer(test2)
%timeit out2(test2)
%timeit out3(test2)
1.24 s ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.1 s ± 63.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
590 ms ± 39.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
598 ms ± 44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
似乎每种解决方案在增加数据维度方面都受到不同的影响。总体看来,out3是最好的选择。本质上,最好在处理输入数据之前先更改其布局。