Question

我有一个大型的pandas数据帧，包含时间序列数据和相当大的多索引。所述索引包含关于时间序列的各种信息，例如位置，数据类型等。

现在我想在索引中添加一个新行，其中包含一个整数（或浮点数，并不重要），包含到某个点的距离。接下来，我想按这个距离对数据帧进行排序。

我不确定如何添加新的索引级别，以及如何手动分配新值。此外，pandas甚至可以在其中一个索引级别中的随机数之后对列进行排序吗？

示例

（来自here的代码）

header=pd.MultiIndex.from_product([['location1','location2'],['S1','S2','S3']],names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header)

看起来像这样：

loc  location1                      location2                    
S           S1        S2        S3         S1        S2        S3
a     1.530590  0.536364  1.295848   0.422256 -1.853786  1.334981
b     0.275857 -0.848685 -1.212584  -0.464235 -0.855600  0.680985
c    -1.209607  0.265359 -0.695233   0.643896  1.315216 -0.751027
d    -1.591613 -0.178605  0.878567   0.647389 -0.454313 -1.972509
e     1.098193 -0.766810  0.087173   0.714301 -0.886545 -0.826163

我想要的是，在第一步中，为每列添加一些距离，例如location1 S1 add dist 200，location1 S2 add dist 760等等，从而产生以下结果：

loc  location1                      location2                    
S           S1        S2        S3         S1        S2        S3
dist       200       760        10       1000       340        70
a     1.530590  0.536364  1.295848   0.422256 -1.853786  1.334981
b     0.275857 -0.848685 -1.212584  -0.464235 -0.855600  0.680985
c    -1.209607  0.265359 -0.695233   0.643896  1.315216 -0.751027
d    -1.591613 -0.178605  0.878567   0.647389 -0.454313 -1.972509
e     1.098193 -0.766810  0.087173   0.714301 -0.886545 -0.826163

然后执行df.sortlevel('dist')之类的操作，生成

loc location1 location2 location1 location2 location1 location2
S          S3        S3        S1        S2        S2        S1
dist       10        70       200       340       760      1000
a    1.295848  1.334981  1.530590 -1.853786  0.536364  0.422256
b   -1.212584  0.680985  0.275857 -0.855600 -0.848685 -0.464235
…

对于像plt.matshow(df.corr())这样的事情来说，整个事情按距离排序。

pandas甚至可以在带有整数的随机索引后对df进行排序吗？因为我有另一个数据帧，其多索引中已经有一个整数，而some_otherdf.sortlevel('HZB')导致TypeError: can only sort by level with a hierarchical index

修改

截至目前，有两个答案，这两个答案对我的测试用例来说都非常好。我认为@Pedro M Duarte的答案可能更正确，因为它使用了多指数。但是，对于我的真实数据，对于7级深度多索引和50个数据系列，它需要大量的返工或大量输入，这非常容易出错。 @Nader Hisham忽略了我留在我的多索引中的请求，但它只需要快速，简单和容易地检查一行简单的数字（节省我很多时间），然后我可以在排序后删除。对于有类似问题的其他人，可能会有所不同。

Answer 1

In[1]:
import pandas as pd
import numpy as np

header=pd.MultiIndex.from_product(
    [['location1','location2'],['S1','S2','S3']],
    names=['loc','S'])

df = pd.DataFrame(np.random.randn(5, 6), 
                  index=['a','b','c','d','e'], columns = header)

print(df)

Out[1]:
    loc location1                     location2                    
    S          S1        S2        S3        S1        S2        S3
    a    0.503357 -0.461202 -1.412865  0.866237  1.290292  0.635869
    b   -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
    c   -1.448068 -0.121475 -0.838693  0.047861 -0.131904  1.154370
    d    1.758752 -0.094962 -2.035204 -0.399195 -0.756726  1.609393
    e    0.421521  1.134518 -0.809148 -0.543523 -1.161328  1.261901



In[2]:
distances = {
    ('location1','S1'): 200,
    ('location1','S2'): 760,
    ('location1','S3'): 10,
    ('location2','S1'): 1000,
    ('location2','S2'): 340,
    ('location2','S3'): 70,
}

index = df.columns
df.columns = pd.MultiIndex.from_tuples(
    [(key[0], key[1], distances[key],) for key in index.get_values()],
    names=[index.get_level_values(0).name,
           index.get_level_values(1).name,
           'dist']
)
print(df)

Out[2]:
    loc  location1                     location2                    
    S           S1        S2        S3        S1        S2        S3
    dist      200       760       10        1000      340       70  
    a     0.503357 -0.461202 -1.412865  0.866237  1.290292  0.635869
    b    -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
    c    -1.448068 -0.121475 -0.838693  0.047861 -0.131904  1.154370
    d     1.758752 -0.094962 -2.035204 -0.399195 -0.756726  1.609393
    e     0.421521  1.134518 -0.809148 -0.543523 -1.161328  1.261901



In[3]:
result = df.sortlevel(level=2, axis=1)
print(result)

Out[3]:
    loc  location1 location2 location1 location2 location1 location2
    S           S3        S3        S1        S2        S2        S1
    dist      10        70        200       340       760       1000
    a    -1.412865  0.635869  0.503357  1.290292 -0.461202  0.866237
    b    -0.198654 -1.918091 -0.904658 -1.070291 -1.190422 -0.916884
    c    -0.838693  1.154370 -1.448068 -0.131904 -0.121475  0.047861
    d    -2.035204  1.609393  1.758752 -0.756726 -0.094962 -0.399195
    e    -0.809148  1.261901  0.421521 -1.161328  1.134518 -0.543523

Answer 2

In [35]:
df.loc['dist' , : ] = [200,760,10,1000,340,70]
df
Out[35]:
loc                location1                 location2
S     S1              S2            S3            S1          S2    S3
a     0.348766  -0.326088   -0.891929   -0.704856   -1.514304   0.611692
b    -0.546026  -0.111232   -1.022104   -1.246002   0.328385    0.576465
c    -0.743512  -0.362791   -0.617021   -0.859157   -0.300368   0.292980
d     0.090178  1.369648    0.171753    -0.411466   0.478654    1.814878
e    -0.380414  -1.568492   -0.432858   1.034861    -0.633563   1.403627
dist 200.000000 760.000000  10.000000   1000.000000 340.000000  70.000000


In [36]:
order = np.argsort(df.loc['dist' , :]).values
order
Out[36]:
array([2, 5, 0, 4, 1, 3], dtype=int64)

In [37]:

df.iloc[: , order]
Out[37]:
loc    location1    location2   location1   location2   location1   location2
S            S3      S3            S1         S2          S2           S1
a     -0.891929    0.611692     0.348766    -1.514304   -0.326088   -0.704856
b     -1.022104    0.576465    -0.546026    0.328385    -0.111232   -1.246002
c     -0.617021    0.292980    -0.743512    -0.300368   -0.362791   -0.859157
d     0.171753     1.814878     0.090178    0.478654    1.369648    -0.411466
e     -0.432858    1.403627     -0.380414   -0.633563   -1.568492   1.034861
dist  10.000000    70.000000    200.000000  340.000000  760.000000  1000.000000

如果您想将dist索引作为第一个索引，则可以执行以下操作

将新索引行添加到现有数据框并按其排序

2 个答案: