Calculate difference for subset rows in python

时间:2018-09-22 23:01:53

标签: python subset

I have the following python dataframe

Variable_1  Variable_2  Variable_3  Target
G   M   I   230
G   M   I   231
G   M   I   233
G   M   I   231
G   M   I   230
G   M   I   214
G   M   L   211
G   M   L   212
G   M   L   123
G   M   L   345
G   N   J   32
G   N   J   123
G   N   J   234
G   N   O   2345
G   N   O   432
G   N   O   455
G   N   O   543
G   N   O   333

Let's consider only Variable_3. For each category of Variable_3 I want to compare the last of that Target against the first value of the Target. For example:

  • when Variable_3 is equal to "I" then I compare 214 (which is the last value) against 230 which is the "first" value and if the last value is greater than the first value then I create a new field called "Output" which is equal to 1, otherwise the field "Output" is equal to -1.

From the example above, I would like my resulting dataset to look like this:

Variable_1  Variable_2  Variable_3  Target  Output
G   M   I   230 -1
G   M   I   231 -1
G   M   I   233 -1
G   M   I   231 -1
G   M   I   230 -1
G   M   I   214 -1
G   M   L   211 1
G   M   L   212 1
G   M   L   123 1
G   M   L   345 1
G   N   J   32  1
G   N   J   123 1
G   N   J   234 1
G   N   O   2345    -1
G   N   O   432 -1
G   N   O   455 -1
G   N   O   543 -1
G   N   O   333 -1

2 个答案:

答案 0 :(得分:1)

通过Variable_3对数据进行分组,并在每个组中找到第一个和最后一个Target。比较它们:

groups = df.groupby('Variable_3')['Target']
output = groups.first() > groups.last()

基于Variable_3作为索引,将输出与旧数据框合并在一起:

df = df.set_index('Variable_3').join(output, rsuffix='_r').reset_index()

将布尔值转换为1s和-1s:

import numpy as np
df['Target_r'] = np.where(df['Target_r'], -1, 1)

最后,更改新的列名:

df.rename(columns={'Target_r' : 'Output'}, inplace=True)

答案 1 :(得分:1)

尝试:

df.loc[:, 'Output'] = df.groupby('Variable_3')['Target']\
                        .transform(lambda x: -1 if x.iloc[-1] > x.iloc[0] else 1)