以下是我创建多列表的方法:
whatFields = ['mean', 'mom_2', 'n']
groupbyFields = ['foo', 'bar']
topFields = ['desc']*len(groupbyFields)
topFields += ['price']*len(whatFields)
topFields += ['units']*len(whatFields)
bottomFields = groupbyFields + whatFields + whatFields
resultsDf = pd.DataFrame(columns=pd.MultiIndex.from_arrays([topFields, bottomFields]))
indexFields = [('desc', field) for field in groupbyFields]
resultsDf.set_index(indexFields, inplace=True)
这是空的结果:
Empty DataFrame
Columns: [(price, mean), (price, mom_2), (price, n), (units, mean), (units, mom_2), (units, n)]
Index: []
>>> resultsDf.index
Out[2]:
MultiIndex(levels=[[], []],
labels=[[], []],
names=[('desc', 'foo'), ('desc', 'bar')])
然而,在填写之后,它看起来像这样:
price units
mean mom_2 n mean mom_2 n
(desc, foo) (desc, bar)
1500002071 4292 NaN NaN NaN NaN NaN NaN
4246 NaN NaN NaN NaN NaN NaN
342 NaN NaN NaN NaN NaN NaN
104 NaN NaN NaN NaN NaN NaN
4218 2.59 0 1 NaN NaN NaN
问题是索引字段在元组形式中具有这些奇怪的名称,而列具有"正确的"现在名称为多列形状。
你可能认为这是因为他们是一个索引。号:
(desc, foo) (desc, bar) price units
mean mom_2 n mean mom_2 n
0 1500002071 4292 NaN NaN NaN NaN NaN NaN
1 1500002071 4246 NaN NaN NaN NaN NaN NaN
2 1500002071 342 NaN NaN NaN NaN NaN NaN
3 1500002071 104 NaN NaN NaN NaN NaN NaN
4 1500002071 4218 2.59 0 1 NaN NaN NaN
为什么索引在多布局方面不遵循列?毫无疑问,我想通过foo
和bar
(或真正的多索引,至少不是这个伪元组)来访问索引。
我怎么能实现这一目标?有没有更好的方法来生成我的空df开始?
答案 0 :(得分:0)
这是你在找什么?我不确定你想要如何设置主索引。
两种方式:
In [1]: import numpy as np
In [2]: import pandas as pd
i
In [3]: import itertools as it
In [4]: whatFields = ['mean', 'mom_2', 'n']
...: groupbyFields = ['foo', 'bar']
...: topFields= ['price', 'units']
...: descriptions = [11, 22, 33, 44]
...:
...: top_index = list(it.product(topFields, whatFields))
...:
...: main_index = list(it.product(descriptions, groupbyFields))
...: main_index
Out[4]:
[(11, 'foo'),
(11, 'bar'),
(22, 'foo'),
(22, 'bar'),
(33, 'foo'),
(33, 'bar'),
(44, 'foo'),
(44, 'bar')]
In [5]: top_index
Out[5]:
[('price', 'mean'),
('price', 'mom_2'),
('price', 'n'),
('units', 'mean'),
('units', 'mom_2'),
('units', 'n')]
In [6]: resultsDf = pd.DataFrame(index=pd.MultiIndex.from_tuples(main_index)
...: .set_names(['desc', 'something']),
...: columns=pd.MultiIndex.from_tuples(top_index),
...: data=np.random.rand(len(main_index), len(top_index))
...: ).sort_index()
In [7]: resultsDf
Out[7]:
price units
mean mom_2 n mean mom_2 n
desc something
11 bar 0.415331 0.153503 0.750690 0.505439 0.781057 0.102450
foo 0.444163 0.921779 0.587966 0.988859 0.747277 0.645065
22 bar 0.205548 0.835086 0.630778 0.936277 0.587607 0.644636
foo 0.907772 0.927121 0.457286 0.881467 0.091484 0.217839
33 bar 0.207454 0.670291 0.609697 0.024396 0.808362 0.738188
foo 0.838015 0.058354 0.804375 0.704137 0.760060 0.638933
44 bar 0.577411 0.085774 0.394033 0.798052 0.107777 0.852888
foo 0.528873 0.902225 0.098982 0.611146 0.122890 0.887364
或者:
In [10]: resultsDf = pd.DataFrame(columns=pd.MultiIndex.from_tuples(top_index),
...: data=np.random.rand(len(main_index), len(top_index)) )
...:
...: resultsDf['desc'], resultsDf['something'] = zip(*main_index)
...:
...:
...: resultsDf = resultsDf.set_index(['desc', 'something']).sort_index()
...:
In [11]: resultsDf
Out[11]:
price units
mean mom_2 n mean mom_2 n
desc something
11 foo 0.205574 0.673159 0.772009 0.598809 0.070022 0.332420
bar 0.844376 0.602825 0.433186 0.420408 0.299380 0.354098
22 foo 0.341226 0.489068 0.784226 0.721386 0.866248 0.113838
bar 0.729578 0.209731 0.533399 0.993587 0.340383 0.895143
33 foo 0.629427 0.285344 0.634120 0.940294 0.378314 0.416081
bar 0.251746 0.022984 0.415058 0.322093 0.719954 0.251906
44 foo 0.247829 0.085609 0.680114 0.760157 0.493465 0.659629
bar 0.667425 0.749589 0.578318 0.190334 0.131337 0.090083
In [13]: resultsDf.loc[(22, "bar")]
Out[13]:
price mean 0.729578
mom_2 0.209731
n 0.533399
units mean 0.993587
mom_2 0.340383
n 0.895143
Name: (22, bar), dtype: float64
In [14]: resultsDf.loc[(22, "bar"), "units"]
Out[14]:
mean 0.993587
mom_2 0.340383
n 0.895143
Name: (22, bar), dtype: float64