合并行pandas数据帧

时间:2017-07-13 11:52:48

标签: python pandas dataframe

我有一个像这样的pandas数据框:

df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])

df
Out[18]: 
   Start Sample  End Sample  Value Start Name End Name  Start Time  End Time
0             0          10      0          A        A           6         7
1            11          21      1          A        A           8         9
2             0          13      1          B        B          11        13
3             0          12      1          C        C          14        15
4            13          14      0          C        C          16        18

如果行Value的开始时间与行i+1的结束时间之间的差异为i < 3的连续行分组>

例如,行1,2,3是具有相同值的连续行。

df['Start Time'].iloc[2] - df['End Time'].iloc[1] is = 2
df['Start Time'].iloc[3] - df['End Time'].iloc[2] is = 1

所以他们都应该合并。 我希望这些行成为:

df2
Out[25]: 
   Start Sample  End Sample  Value Start Name End Name  Start Time  End Time
0             0          10      0          A        A           6         7
1            11          12      1          A        C           8        15
2            13          14      0          C        C          16        18

请注意,新合并的行应具有:

1) Start Sample = to the Start Sample of the first row merged
2) End Sample = to the End Sample of the last row merged
3) Value = to the common value
4) Start Name = to the Start Name of the first row merged
5) End Name = to the End Name of the last row merged
6) Start Time = to the Start Name of the first row merged
7) End Name = to the End Name of the last row merged

2 个答案:

答案 0 :(得分:2)

首先给你一些代码,然后再考虑一些解释。这里的方法是根据您的&#34;价值&#34;进入子集。并研究这些子数据帧。

def agg(series):
    if series.name.startswith('Start'):
        return series.iloc[0]
    return series.iloc[-1]

subsets = [subset.apply(agg) for _, subset in 
             df.groupby((df['Value']!=df['Value'].shift(1)).cumsum())]

pd.concat(subsets, axis=1).T

&#34;棘手&#34;部分是df['Value']!=df['Value'].shift(1)).cumsum()。这可以找到&#34; Value&#34;变化。我们将对此进行分组,但首先cumsum()会给出唯一值。

groupby之后,您正在遍历您感兴趣的数据框的子集。从这里您可以做很多事情,这就是为什么这是灵活的。

对于每个子集,apply函数将应用于每个系列(列)。在您的情况下,您正在查找基于列名称的两个值之一,因此可以将一个函数(此处agg)应用于每个系列。

编辑:上述更改测试仅包含OP指定的两个标准之一。包括两者都很容易,但扩展了逻辑,所以它应该被打破一点。我已经在推动一个不合理的oneliner的界限。所以groupby条件应该是:

val_chg = df['Value'] != df['Value'].shift(1)
time_chg = df['Start Time']-df['End Time'].shift(1) >=3

df.groupby((val_chg | time_chg).cumsum())

答案 1 :(得分:0)

可能有更好的方法,但这里是iterrows()方法:

df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])
df['keep'] = ''

active_row = None

for i, row in df.iterrows():
    if active_row is None:
        active_row = i
        df.loc[i,'keep'] = 1
        continue

    if row['Value'] != df.loc[active_row,'Value']:
        active_row = i
        df.loc[i,'keep'] = 1
        continue
    elif row['Start Time'] - df.loc[active_row,'End Time'] >= 3:
        active_row = i
        df.loc[i,'keep'] = 1
        continue

    df.loc[active_row,'End Time'] = row['End Time']
    df.loc[active_row,'End Sample'] = row['End Sample']
    df.loc[active_row,'End Name'] = row['End Name']
    df.loc[i,'keep'] = 0

final_df=df[df.keep == 1].drop('keep',axis=1)

它遍历行,记住最后一个有意义的行并在循环期间更新它。每个循环将一行分为keep(1)或不保持(0),我们用它来手动过滤它们。