我有一个像这样的pandas数据框:
df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])
df
Out[18]:
Start Sample End Sample Value Start Name End Name Start Time End Time
0 0 10 0 A A 6 7
1 11 21 1 A A 8 9
2 0 13 1 B B 11 13
3 0 12 1 C C 14 15
4 13 14 0 C C 16 18
如果行Value
的开始时间与行i+1
的结束时间之间的差异为i
例如,行1,2,3是具有相同值的连续行。
df['Start Time'].iloc[2] - df['End Time'].iloc[1] is = 2
df['Start Time'].iloc[3] - df['End Time'].iloc[2] is = 1
所以他们都应该合并。 我希望这些行成为:
df2
Out[25]:
Start Sample End Sample Value Start Name End Name Start Time End Time
0 0 10 0 A A 6 7
1 11 12 1 A C 8 15
2 13 14 0 C C 16 18
请注意,新合并的行应具有:
1) Start Sample = to the Start Sample of the first row merged
2) End Sample = to the End Sample of the last row merged
3) Value = to the common value
4) Start Name = to the Start Name of the first row merged
5) End Name = to the End Name of the last row merged
6) Start Time = to the Start Name of the first row merged
7) End Name = to the End Name of the last row merged
答案 0 :(得分:2)
首先给你一些代码,然后再考虑一些解释。这里的方法是根据您的"价值"进入子集。并研究这些子数据帧。
def agg(series):
if series.name.startswith('Start'):
return series.iloc[0]
return series.iloc[-1]
subsets = [subset.apply(agg) for _, subset in
df.groupby((df['Value']!=df['Value'].shift(1)).cumsum())]
pd.concat(subsets, axis=1).T
"棘手"部分是df['Value']!=df['Value'].shift(1)).cumsum()
。这可以找到" Value"变化。我们将对此进行分组,但首先cumsum()
会给出唯一值。
在groupby
之后,您正在遍历您感兴趣的数据框的子集。从这里您可以做很多事情,这就是为什么这是灵活的。
对于每个子集,apply
函数将应用于每个系列(列)。在您的情况下,您正在查找基于列名称的两个值之一,因此可以将一个函数(此处agg
)应用于每个系列。
编辑:上述更改测试仅包含OP指定的两个标准之一。包括两者都很容易,但扩展了逻辑,所以它应该被打破一点。我已经在推动一个不合理的oneliner的界限。所以groupby条件应该是:
val_chg = df['Value'] != df['Value'].shift(1)
time_chg = df['Start Time']-df['End Time'].shift(1) >=3
df.groupby((val_chg | time_chg).cumsum())
答案 1 :(得分:0)
可能有更好的方法,但这里是iterrows()
方法:
df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])
df['keep'] = ''
active_row = None
for i, row in df.iterrows():
if active_row is None:
active_row = i
df.loc[i,'keep'] = 1
continue
if row['Value'] != df.loc[active_row,'Value']:
active_row = i
df.loc[i,'keep'] = 1
continue
elif row['Start Time'] - df.loc[active_row,'End Time'] >= 3:
active_row = i
df.loc[i,'keep'] = 1
continue
df.loc[active_row,'End Time'] = row['End Time']
df.loc[active_row,'End Sample'] = row['End Sample']
df.loc[active_row,'End Name'] = row['End Name']
df.loc[i,'keep'] = 0
final_df=df[df.keep == 1].drop('keep',axis=1)
它遍历行,记住最后一个有意义的行并在循环期间更新它。每个循环将一行分为keep(1)或不保持(0),我们用它来手动过滤它们。