Question

以下是我创建多列表的方法：

whatFields = ['mean', 'mom_2', 'n']
groupbyFields = ['foo', 'bar']
topFields = ['desc']*len(groupbyFields)
topFields += ['price']*len(whatFields)
topFields += ['units']*len(whatFields)
bottomFields = groupbyFields + whatFields + whatFields
resultsDf = pd.DataFrame(columns=pd.MultiIndex.from_arrays([topFields, bottomFields]))
indexFields = [('desc', field) for field in groupbyFields]
resultsDf.set_index(indexFields, inplace=True)

这是空的结果：

Empty DataFrame
Columns: [(price, mean), (price, mom_2), (price, n), (units, mean), (units, mom_2), (units, n)]
Index: []

>>> resultsDf.index
Out[2]: 
MultiIndex(levels=[[], []],
           labels=[[], []],
           names=[('desc', 'foo'), ('desc', 'bar')])

然而，在填写之后，它看起来像这样：

                                     price            units           
                                      mean mom_2    n  mean mom_2    n
(desc, foo) (desc, bar)                                  
1500002071  4292                       NaN   NaN  NaN   NaN   NaN  NaN
            4246                       NaN   NaN  NaN   NaN   NaN  NaN
            342                        NaN   NaN  NaN   NaN   NaN  NaN
            104                        NaN   NaN  NaN   NaN   NaN  NaN
            4218                      2.59     0    1   NaN   NaN  NaN

问题是索引字段在元组形式中具有这些奇怪的名称，而列具有＆＃34;正确的＆＃34;现在名称为多列形状。

你可能认为这是因为他们是一个索引。号：

  (desc, foo) (desc, bar) price            units           
                                        mean mom_2    n  mean mom_2    n
0  1500002071                     4292   NaN   NaN  NaN   NaN   NaN  NaN
1  1500002071                     4246   NaN   NaN  NaN   NaN   NaN  NaN
2  1500002071                      342   NaN   NaN  NaN   NaN   NaN  NaN
3  1500002071                      104   NaN   NaN  NaN   NaN   NaN  NaN
4  1500002071                     4218  2.59     0    1   NaN   NaN  NaN

为什么索引在多布局方面不遵循列？毫无疑问，我想通过foo和bar（或真正的多索引，至少不是这个伪元组）来访问索引。

我怎么能实现这一目标？有没有更好的方法来生成我的空df开始？

Answer 1

这是你在找什么？我不确定你想要如何设置主索引。

两种方式：

In [1]: import numpy as np

In [2]: import pandas as pd
i
In [3]: import itertools as it

In [4]: whatFields = ['mean', 'mom_2', 'n']
   ...: groupbyFields = ['foo', 'bar']
   ...: topFields= ['price', 'units']
   ...: descriptions = [11, 22, 33, 44]
   ...:
   ...: top_index = list(it.product(topFields, whatFields))
   ...:
   ...: main_index = list(it.product(descriptions, groupbyFields))
   ...: main_index
Out[4]:
[(11, 'foo'),
 (11, 'bar'),
 (22, 'foo'),
 (22, 'bar'),
 (33, 'foo'),
 (33, 'bar'),
 (44, 'foo'),
 (44, 'bar')]

In [5]: top_index
Out[5]:
[('price', 'mean'),
 ('price', 'mom_2'),
 ('price', 'n'),
 ('units', 'mean'),
 ('units', 'mom_2'),
 ('units', 'n')]

In [6]: resultsDf = pd.DataFrame(index=pd.MultiIndex.from_tuples(main_index)
   ...:                                  .set_names(['desc', 'something']),
   ...:                          columns=pd.MultiIndex.from_tuples(top_index),
   ...:                         data=np.random.rand(len(main_index), len(top_index))
   ...:                         ).sort_index()

In [7]: resultsDf
Out[7]:
                   price                         units
                    mean     mom_2         n      mean     mom_2         n
desc something
11   bar        0.415331  0.153503  0.750690  0.505439  0.781057  0.102450
     foo        0.444163  0.921779  0.587966  0.988859  0.747277  0.645065
22   bar        0.205548  0.835086  0.630778  0.936277  0.587607  0.644636
     foo        0.907772  0.927121  0.457286  0.881467  0.091484  0.217839
33   bar        0.207454  0.670291  0.609697  0.024396  0.808362  0.738188
     foo        0.838015  0.058354  0.804375  0.704137  0.760060  0.638933
44   bar        0.577411  0.085774  0.394033  0.798052  0.107777  0.852888
     foo        0.528873  0.902225  0.098982  0.611146  0.122890  0.887364

或者：

In [10]: resultsDf = pd.DataFrame(columns=pd.MultiIndex.from_tuples(top_index),
    ...:                         data=np.random.rand(len(main_index), len(top_index)) )
    ...:
    ...: resultsDf['desc'], resultsDf['something'] = zip(*main_index)
    ...:
    ...:
    ...: resultsDf = resultsDf.set_index(['desc', 'something']).sort_index()
    ...:

In [11]: resultsDf
Out[11]:
                   price                         units
                    mean     mom_2         n      mean     mom_2         n
desc something
11   foo        0.205574  0.673159  0.772009  0.598809  0.070022  0.332420
     bar        0.844376  0.602825  0.433186  0.420408  0.299380  0.354098
22   foo        0.341226  0.489068  0.784226  0.721386  0.866248  0.113838
     bar        0.729578  0.209731  0.533399  0.993587  0.340383  0.895143
33   foo        0.629427  0.285344  0.634120  0.940294  0.378314  0.416081
     bar        0.251746  0.022984  0.415058  0.322093  0.719954  0.251906
44   foo        0.247829  0.085609  0.680114  0.760157  0.493465  0.659629
     bar        0.667425  0.749589  0.578318  0.190334  0.131337  0.090083

In [13]: resultsDf.loc[(22, "bar")]
Out[13]:
price  mean     0.729578
       mom_2    0.209731
       n        0.533399
units  mean     0.993587
       mom_2    0.340383
       n        0.895143
Name: (22, bar), dtype: float64

In [14]: resultsDf.loc[(22, "bar"), "units"]
Out[14]:
mean     0.993587
mom_2    0.340383
n        0.895143
Name: (22, bar), dtype: float64

将MultiColumn设置为索引会导致索引名称出现问题

1 个答案: