对于以下数据框,我想转移数据。按“ ID”和“系列”分组,应将Q,R和T列中的数据向下移动到“状态”为“结束”的行。
data = pd.DataFrame({
'ID': ['A','A','A','B','B','B','B','C','C','C','C','C','D','D'],
'Series': [1,1,1,1,1,2,2,1,1,2,2,2,1,1],
'Status': ['Begin','Begin','End','Begin','End','Begin','End','Begin','End','Begin','Begin','End','Begin','End'],
'Q':[9,'','',30,'',14,'',3,'',17,'','',1,''],
'R': ['',8,'','','','','','','','',7,'','',''],
'T': ['','',12,'',38,'',21,'',6,'','',35,'',5]
})
结果应如下:
result = pd.DataFrame({
'ID': ['A','A','A','B','B','B','B','C','C','C','C','C','D','D'],
'Series': [1,1,1,1,1,2,2,1,1,2,2,2,1,1],
'Status': ['Begin','Begin','End','Begin','End','Begin','End','Begin','End','Begin','Begin','End','Begin','End'],
'Q':['','',9,'',30,'',14,'',3,'','',17,'',1],
'R': ['','',8,'','','','','','','','',7,'',''],
'T': ['','',12,'',38,'',21,'',6,'','',35,'',5]
})
答案 0 :(得分:0)
首先让分组依据ID
和g
,其中g
将Begin
和End
的每组分组。
g = (data.Status == 'End').cumsum().shift(1).fillna(0)
l = lambda x: ''.join(x.astype(str))
df2 = data.groupby(['ID', g],as_index=False)[['Series','Status','Q', 'R', 'T']].\
agg({'Series':'first', 'Status': 'last', 'Q': l, 'R': l, 'T':l})
然后只需将索引设置为匹配,然后使用loc
df2 = df2.set_index(['ID', 'Status', 'Series'])
data = data.set_index(['ID', 'Status', 'Series'])
data.loc[:, ['Q', 'R', 'T']] = df2[['Q', 'R', 'T']]
data = data.fillna("").reset_index()
ID Status Series Q R T
0 A Begin 1
1 A Begin 1
2 A End 1 9 8 12
3 B Begin 1
4 B End 1 30 38
5 B Begin 2
6 B End 2 14 21
7 C Begin 1
8 C End 1 3 6
9 C Begin 2
10 C Begin 2
11 C End 2 17 7 35
12 D Begin 1
13 D End 1 1 5
答案 1 :(得分:0)
使用GroupBy.transform
+ GroupBy.first
查找第一个非NaN
的值,然后通过mask
和duplicated
删除重复的值:
cols = ['Q', 'R', 'T']
#repalce emty strings to NaNs
data[cols] = data[cols].astype(str).replace('', np.nan)
print (data)
ID Series Status Q R T
0 A 1 Begin 9 NaN NaN
1 A 1 Begin NaN 8 NaN
2 A 1 End NaN NaN 12
3 B 1 Begin 30 NaN NaN
4 B 1 End NaN NaN 38
5 B 2 Begin 14 NaN NaN
6 B 2 End NaN NaN 21
7 C 1 Begin 3 NaN NaN
8 C 1 End NaN NaN 6
9 C 2 Begin 17 NaN NaN
10 C 2 Begin NaN 7 NaN
11 C 2 End NaN NaN 35
12 D 1 Begin 1 NaN NaN
13 D 1 End NaN NaN 5
g = data.groupby(['ID', 'Series'])
for c in cols:
data[c] = g[c].transform('first')
print (data)
ID Series Status Q R T
0 A 1 Begin 9 8 12
1 A 1 Begin 9 8 12
2 A 1 End 9 8 12
3 B 1 Begin 30 NaN 38
4 B 1 End 30 NaN 38
5 B 2 Begin 14 NaN 21
6 B 2 End 14 NaN 21
7 C 1 Begin 3 NaN 6
8 C 1 End 3 NaN 6
9 C 2 Begin 17 7 35
10 C 2 Begin 17 7 35
11 C 2 End 17 7 35
12 D 1 Begin 1 NaN 5
13 D 1 End 1 NaN 5
data[cols] = data[cols].mask(data.duplicated(['ID','Series'], keep='last'), '').fillna('')
print (data)
ID Series Status Q R T
0 A 1 Begin
1 A 1 Begin
2 A 1 End 9 8 12
3 B 1 Begin
4 B 1 End 30 38
5 B 2 Begin
6 B 2 End 14 21
7 C 1 Begin
8 C 1 End 3 6
9 C 2 Begin
10 C 2 Begin
11 C 2 End 17 7 35
12 D 1 Begin
13 D 1 End 1 5