Question

我有一个包含个人和日期的时间序列数据集。我想创建一个虚拟变量＆＃34; newpers＆＃34;，它首次假定值为1，按时间顺序，id显示在数据集中。例如，如果简化数据集如下所示：

personid     yearmo
       1 2004-05-01
       1 2004-06-01
       2 2004-05-01
       2 2004-06-01

我想要制作的是：

personid     yearmo newpers
       1 2004-05-01       1
       1 2004-06-01       0
       2 2004-05-01       1
       2 2004-06-01       0

很抱歉，如果这很容易，但我一直在圈子里，我很难过。我一直试图分组/排序为每个人拉出第一个时间顺序日期。虚拟变量可能是newpers=(yearmo==firstmo)，但我似乎无法使groupby / sort不会抛出错误。

Answer 1

这应该有效（假设按人物，年份排序）

df['newpers'] = df.personid != df.personid.shift(1)

Answer 2

我会使用shift方法在数据框中向后看：

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 1, 1, 2, 2, 3, 3, 3, 3, 3], 
    'B': np.random.random_integers(low=0, high=10, size=10)
})
df['A_'] = df['A'].shift()  # each row will contain the previous value of A
df['new_A'] = df.apply(lambda row: int(row['A'] != row['A_']), axis=1)

   A   B  A_  new_A
0  1  10 NaN      1
1  1   3   1      0
2  1   8   1      0
3  2   6   1      1
4  2   4   2      0
5  3   2   2      1
6  3   4   3      0
7  3   1   3      0
8  3   0   3      0
9  3   1   3      0

由personid创建一个虚拟变量

2 个答案: