Question

摘要

假设您将apply和function g.apply分组为一个groupby对象，那么g中每个df.groupby(...)的每个event_df都会为您提供一个序列/数据帧。如何将这些结果合并到单个数据框中，但将组名称作为列？

详细信息

我有一个像这样的数据框index event note time 0 on C 0.5 1 on D 0.75 2 off C 1.0 ...：

event

我想为每个note创建一个t_df的采样，并且该采样是在index t 0 0 1 0.5 2 1.0 ...给定的时间完成的：

t        C         D        
0        off       off
0.5      on        off
1.0      off       on
...

这样我就会得到这样的东西。

def get_t_note_series(notedata_row, t_arr):
   """Return the time index in the sampling that corresponds to the event."""
   t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
   return t_idx

def get_t_for_gb(group, **kwargs):
   t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
   t_idxs.rename('t_arr_idx', inplace=True)
   group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
   print(group_with_t)
   return group_with_t


t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)

我到目前为止所做的事情：

t     event
0     on
0.5   off
...

t     event
0     off
0.5   on
...

所以我得到的是每个音符的多个数据帧，大小都相同（与t_df相同）：

如何从此处转到所需的数据框，每个组对应于新数据框中的一列，索引为{{1}}？

Answer 1

编辑：抱歉，下面我没有考虑到您重新调整time列的大小，现在不能提供完整的解决方案，因为我必须离开，但我认为，您可以进行重新缩放通过对两个数据帧使用pandas.merge_asof来获取最接近的“重新缩放”时间，并从合并的数据帧中尝试以下代码。我希望这就是您想要的。

import pandas as pd
import io 

sio= io.StringIO("""index   event   note   time
0       on      C      0.5
1       on      D      0.75
2       off     C      1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)

df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')

将每个时间注释组的第一行乘agg({'event': 'first'})，然后使用note-index列并将其转置，因此note的值将成为列。然后最后填充所有单元格，对于这些单元格，fillna找不到“关闭”的数据点。

这将输出：

Out[28]: 
     event     
note     C    D
time           
0.50    on  off
0.75   off   on
1.00   off  off

如果在时间/注释的组合中开/关不是明确的，您可能还想尝试min或max（如果同一时间/注释中有更多行，其中有些行具有且有一些关闭），而您更喜欢这些值之一（例如，如果有一个打开，则无论有多少关闭，您都希望打开等）。如果您想要类似市长投票的内容，我建议在汇总数据框中（在unstack()之前）添加市长投票栏。

Answer 2

哦，我找到了！我要做的就是unstack分组结果。返回生成分组结果：

def get_t_note_series(notedata_row, t_arr):
   """Return the time index in the sampling that corresponds to the event."""
   t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
   return t_idx

def get_t_for_gb(group, **kwargs):
   t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
   t_idxs.rename('t_arr_idx', inplace=True)
   group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
   ## print(group_with_t) ## unnecessary!
   return group_with_t


t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)

此时，result是一个以note作为索引的数据框：

>> print(result)

          event
note  t
C     0    off
      0.5  on
      1.0  off
....
D     0    off
      0.5  off
      1.0  on
....

做result = result.unstack('note')可以达到目的：

>> result = result.unstack('note')
>> print(result)

         event
note     C      D
t
0        off    off
0.5      on     on
1.0      off    off
....
D     0    off
      0.5  off
      1.0  on
....

结合熊猫将结果作为单个数据框中的多列应用

2 个答案: