我有以下数据框:
arrays = [np.array(['1', '1', '1', '2', '2', '2', '3', '3', '3', '4', '4', '4']),
np.array(['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'])]
df = pd.DataFrame(np.random.randn(12, 3), index=arrays, columns=['Column1', 'Column2', 'Column3'])
df.index.names = ['Index1', 'Index2']
看起来像这样:
Column1 Column2 Column3
Index1 Index2
1 A -0.218251 1.744845 -0.241300
B 1.107614 -0.059469 0.952544
C 0.203066 0.412727 0.057129
2 A 0.432153 0.568879 -1.014900
B -0.713515 -0.790029 1.530333
C 0.547787 -0.161020 0.078548
3 A 0.425833 -0.316999 -0.516260
B 0.980780 0.844847 1.097464
C -1.724548 0.199910 0.961234
4 A 0.130533 -1.249353 -0.848859
B -0.674836 1.404397 1.258285
C 0.741651 1.578671 -1.411311
我想要做的是拆分/应用/合并并返回一个如下所示的数据框:
Column1 Column2 Column3
Index1 Index2
1 B 1.107614 -0.059469 0.952544
C 0.203066 0.412727 0.057129
2 B -0.713515 -0.790029 1.530333
C 0.547787 -0.161020 0.078548
3 A 0.425833 -0.316999 -0.516260
B 0.980780 0.844847 1.097464
4 A 0.130533 -1.249353 -0.848859
B -0.674836 1.404397 1.258285
这里做的是在时间1(在这种情况下为B和C)中基于Column1获取两个最大的A / B / C.它只保留了那两个时间1和2.
然后在时间3,它再次取得基于第1列的两个最大的A / B / C(这个时间A和B),然后将这两个保持在时间3和4中。
有没有办法使用groupby,nlargest和其他任何函数来执行此操作?是否需要自定义功能?
答案 0 :(得分:2)
我会在loc
def f(gt):
n, d = gt
midx = d.index.remove_unused_levels()
xidx = d.loc[midx.levels[0][0], 'Column1'].nlargest(2).index
return [(lv, mx) for lv in midx.levels[0] for mx in xidx]
g = pd.factorize(df.index.get_level_values(0))[0] // 2
grp = df.groupby(g)
df.loc[sum(map(f, grp), [])]
Column1 Column2 Column3
Index1 Index2
1 B 1.107614 -0.059469 0.952544
C 0.203066 0.412727 0.057129
2 B -0.713515 -0.790029 1.530333
C 0.547787 -0.161020 0.078548
3 B 0.980780 0.844847 1.097464
A 0.425833 -0.316999 -0.516260
4 B -0.674836 1.404397 1.258285
A 0.130533 -1.249353 -0.848859
def f(gt):
# When iterating through the group by object
# we will get tuples like (name_of_group, dataframe_slice)
n, d = gt
# A MultiIndex after slicing will have level values that
# will get in the way of the things I'm doing. So I remove them
midx = d.index.remove_unused_levels()
# I `loc` on the first value of the first level. This removes
# the first level for the resulting slice.
# When I use nlargest, the resulting index will only be a ref
# to the index values without the first level.
xidx = d.loc[midx.levels[0][0], 'Column1'].nlargest(2).index
# Then I return a list of tuples to stitch all values from the
# first level to those values from the largest ones from the
# first group.
return [(lv, mx) for lv in midx.levels[0] for mx in xidx]
# Using factorize here to group the entire data frame into pairs
# by that first level
g = pd.factorize(df.index.get_level_values(0))[0] // 2
grp = df.groupby(g)
# The summation concatenates all the lists of tuples into one list
df.loc[sum(map(f, grp), [])]