Question

我有一个带有50ish列和重复ID的DF。我感兴趣的部分是这样的

   ID      Value         year
0   3       200          1995   
1   3       100          2001
2   4       300          1995
3   4       250          2000

每个ID的所有第一项= 1995，但是第二项对应于ValuedFrom列（第二项是每个对象的退休年龄，因此在大多数情况下是其最后一个值）。我想合并所有这三列，以便最后得到两列，就像这样

     ID  Value1995   ValueRetired
0   3       200           100   
1   4       300           250

关于如何执行此操作的任何想法？

Answer 1

一般解决方案：

print (df)
   ID  year  Value
1   3  2003     95
2   3  1995    200
2   3  2001    100
3   4  1995    300
4   4  2000    250
5   4  2004    150
6   5  2000    201
7   5  1995    202 <- remove this row with 1995, because last value of group 5, if seelct next row it is in another group
8   6  2000    203
9   6  2000    204

首先选择1995和所有下一行的索引：

idx = df.index[(df['year'] == 1995) & (df.groupby('ID').cumcount(ascending=False) != 0)]
idx2 = df.index.intersection(idx + 1).union(idx)
df = df.loc[idx2]
print (df)
   ID  year  Value  ValuedFrom
2   3  1995    200        1995
2   3  2001    100        2001
3   4  1995    300        1995
4   4  2000    250        2000

详细信息：

print (df.groupby('ID').cumcount(ascending=False))
1    2
2    1
2    0
3    2
4    1
5    0
6    1
7    0
8    1
9    0
dtype: int64

然后更改列year的值以通过unstack进行整形：

df['year'] = np.where(df['year'] == 1995, 'Value1995', 'ValueRetired')

df = df.set_index(['ID', 'year'])['Value'].unstack().reset_index().rename_axis(None, axis=1)
print (df)
   ID  Value1995  ValueRetired
0   3        200           100
1   4        300           250

Answer 2

您可以创建从Cause到标签的系列映射，然后使用pd.DataFrame.pivot：

year

根据第2列的条件，使用第1列的输入创建新的df列

2 个答案: