列计算中的Pandas MultiIndex DataFrame参考索引值

时间:2018-02-25 20:06:48

标签: python pandas dataframe multi-index

我希望在某些计算中有效地使用DataFrame的MultiIndex中的值。例如,从:

开始
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2017-01-01', periods=50, freq='W').tolist(), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values

假设我想计算一个新列Diff = Num - SmallestNum。一个有效的,但我认为,kludgy的方法是将我想引用的索引级别复制到一个真正的列,然后做差异:

df['NumCol'] = df.index.get_level_values(1)
df['Diff'] = df['NumCol'] - df['SmallestNum']

但是如果我这样做的话,我觉得我仍然不理解使用DataFrames的正确方法。我认为“正确”的解决方案看起来像以下任何一种,它不会创建和存储索引值的完整副本:

df['Diff'] = df.transform(lambda x: x.index.get_level_values(1) - x['SmallestNum'])
df['Diff'] = df.reset_index(level=1).apply(lambda x: x['Num'] - x['SmallestNum'])

...但不仅这些表达式都不起作用*,而且我的理解是像.transform.apply这样的DataFrame操作必然明显慢于显式操作的那些操作“矢量化的“行引用。

那么在这个例子中为新Diff列编写计算的“正确和有效”方法是什么?

* 更新:这个问题由于事实(可能是错误)使得索引级别1值不唯一,这导致公式在索引值唯一失败时工作{{ 1}}。幸运的是,jezrael's answer包含的解决方法似乎与显式矢量化计算一样有效。

1 个答案:

答案 0 :(得分:1)

我认为你需要简单地减去:

df <- structure(list(char = c("a", "b", "b"), num = c(1.1, 2.2, 2.2
), int = c(1L, 2L, 2L)), .Names = c("char", "num", "int"), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

编辑:对于第二级工作中的非唯一df['Diff'] = df.index.get_level_values(1) - df['SmallestNum'] print (df) Vals SmallestNum Diff Name Num A 28 1.180140 28 0 44 0.984257 28 16 90 1.835646 28 62 43 -1.886823 28 15 29 0.424763 28 1 B 80 -0.433105 38 42 61 -0.166838 38 23 46 0.754634 38 8 38 1.966975 38 0 93 0.200671 38 55 C 40 0.742752 12 28 82 -1.264271 12 70 12 -0.112787 12 0 78 0.667358 12 66 70 0.357900 12 58 ,减去由values创建的numpy数组:

DatetimeIndex

另一种解决方案:

np.random.seed(456)
a = pd.date_range('2015-01-01', periods=6).values
j = [['A'] * 5 + ['B'] * 5 + ['C'] * 5, pd.to_datetime(np.random.choice(a, size=15))]
i = pd.MultiIndex.from_arrays(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
df['Diff'] = df.index.get_level_values(1).values - df['SmallestNum'].values
print (df)
                     Vals SmallestNum   Diff
Name Num                                    
A    2015-01-04 -1.842419  2015-01-02 2 days
     2015-01-06 -0.786788  2015-01-02 4 days
     2015-01-04  1.180140  2015-01-02 2 days
     2015-01-02  0.984257  2015-01-02 0 days
     2015-01-03  1.835646  2015-01-02 1 days
B    2015-01-05 -1.886823  2015-01-03 2 days
     2015-01-03  0.424763  2015-01-03 0 days
     2015-01-05 -0.433105  2015-01-03 2 days
     2015-01-06 -0.166838  2015-01-03 3 days
     2015-01-05  0.754634  2015-01-03 2 days
C    2015-01-06  1.966975  2015-01-02 4 days
     2015-01-06  0.200671  2015-01-02 4 days
     2015-01-05  0.742752  2015-01-02 3 days
     2015-01-02 -1.264271  2015-01-02 0 days
     2015-01-04 -0.112787  2015-01-02 2 days