Question

我有包含x和y变量的数据框，索引为：ID，日期和时间。我想创建将通过应用一些已定义函数创建的新变量。

例如，该函数可以是：

def some_function(x1, x2 , y1, y2):
    z = x1*x2 + y1*y2
    return z

真正的功能更加复杂。

注意：该功能应分别应用于每个ID。

数据说明：

ID  date        time    x   y
1   08/27/2019  18:00   1   2
                19:00   3   4
                20:00   ..  ..
                21:00   ..  ..
2   08/28/2019  18:00   ..  ..
                19:00   ..  ..
                19:31   ..  ..
                19:32   ..  ..
                19:34   ..  ..

例如，由于没有上一行，因此新变量的第一行应为0，而第二行应为3 * 1 + 4 * 2 = 11。

Answer 1

您可以通过shift进行操作：

df_shifted= df[['x', 'y']].shift(1).fillna(0)
df['new_col']= df['x']*df_shifted['x']+df['y']*df_shifted['y']

输出看起来像这样：

df= pd.DataFrame(dict(
        ID= [1, 1, 2, 3, 3],
        time= ['02:37', '05:28', '09:01', '10:05', '10:52'],
        x=[1, 3, 4, 7, 1],
        y=[2, 4, 3, 2, 6]
    )
)

df_shifted= df.shift(1).fillna(0)
df['new_col']= df['x']*df_shifted['x']+df['y']*df_shifted['y']
df

Out[474]: 
   ID   time  x  y  new_col
0   1  02:37  1  2      0.0
1   1  05:28  3  4     11.0
2   2  09:01  4  3     24.0
3   3  10:05  7  2     34.0
4   3  10:52  1  6     19.0

因此，它混合了不同ID的行。因此，使用ID 1的最后一行来计算ID 2的值。如果您不想拥有ID 2的值，则需要像这样使用groupby：

# make sure, the dataframe is sorted
df.sort_values(['ID', 'time'], inplace=True)

# define a function that gets the sub dataframes
# which belong to the same id
def calculate(sub_df):
    df_shifted= sub_df.shift(1).fillna(0)
    sub_df['new_col']= sub_df['x']*df_shifted['x']+sub_df['y']*df_shifted['y']
    return sub_df

df.groupby('ID').apply(calculate)

在与上面相同的数据上，输出看起来像这样：

Out[472]: 
   ID   time  x  y  new_col
0   1  02:37  1  2      0.0
1   1  05:28  3  4     11.0
2   2  09:01  4  3      0.0
3   3  10:05  7  2      0.0
4   3  10:52  1  6     19.0

您看到，现在每个组的第一项是0.0。不再发生混合。

Answer 2

您可以尝试：

def myfunc(d):
    return d['x'].mul(d['x'].shift()) + d['y'].mul(d['y'].shift())

df['new_col'] = df.groupby('ID').apply(myfunc)

Answer 3

假设索引是数字，

(df.join(df.groupby('id')[['x','y']].shift(),lsuffix='1',rsuffix='2')
   .apply(lambda x:some_function(x.x1,x.x2,x.y1,x.y2),axis=1))

在两列上应用功能，指的是上一行-Pandas

3 个答案: