我对大数据框有疑问。这是一个小片段。如果有一些值或将其留空,我想用最大值填充最后几列E。那就是数据:
d = {'A': [4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074,
4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074],
'B': ['SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746','SP000796746',
'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746',
'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746'],
'C': [201926, 201926, 201926, 201926, 201926, 201926, 201909,201909, 201909, 201909, 201909,
201909, 201933, 201933, 201933, 201933, 201933, 201933],
'D': [-1, 0, 1, 2, 3, 4, -1, 0, 1, 2, 3, 4, -1, 0, 1, 2, 3, 4],
'E': [np.nan, 1000, 1000, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 3000, 3000, np.nan]}
它看起来像这样:
A B C D E
0 4000074 SP000796746 201926 -1 NaN
1 4000074 SP000796746 201926 0 1000.0
2 4000074 SP000796746 201926 1 1000.0
3 4000074 SP000796746 201926 2 NaN
4 4000074 SP000796746 201926 3 NaN
5 4000074 SP000796746 201926 4 NaN
6 4000074 SP000796746 201909 -1 NaN
7 4000074 SP000796746 201909 0 NaN
8 4000074 SP000796746 201909 1 NaN
9 4000074 SP000796746 201909 2 NaN
10 4000074 SP000796746 201909 3 NaN
11 4000074 SP000796746 201909 4 NaN
12 4000074 SP000796746 201933 -1 NaN
13 4000074 SP000796746 201933 0 NaN
14 4000074 SP000796746 201933 1 NaN
15 4000074 SP000796746 201933 2 3000.0
16 4000074 SP000796746 201933 3 3000.0
17 4000074 SP000796746 201933 4 NaN
但是我的目标是在每个地方用最高的值填充“ E”列 是介于-1到4(列D)之间的任何值。如果不是,则应保留为空。所以它应该像这样:
A B C D E
0 4000074 SP000796746 201926 -1 0
1 4000074 SP000796746 201926 0 1000.0
2 4000074 SP000796746 201926 1 1000.0
3 4000074 SP000796746 201926 2 0
4 4000074 SP000796746 201926 3 0
5 4000074 SP000796746 201926 4 0
6 4000074 SP000796746 201909 -1 NaN
7 4000074 SP000796746 201909 0 NaN
8 4000074 SP000796746 201909 1 NaN
9 4000074 SP000796746 201909 2 NaN
10 4000074 SP000796746 201909 3 NaN
11 4000074 SP000796746 201909 4 NaN
12 4000074 SP000796746 201933 -1 3000.0
13 4000074 SP000796746 201933 0 3000.0
14 4000074 SP000796746 201933 1 3000.0
15 4000074 SP000796746 201933 2 3000.0
16 4000074 SP000796746 201933 3 3000.0
17 4000074 SP000796746 201933 4 3000.0
我的代码如下:
df = d
indx = df[df['D'] == -1].index.values
for i, j in zip(indx[:-1], indx[1:]):
df.loc[i:j-1, 'E'] = df.loc[i:j-1, 'E'].max()
if j == indx[-1]:
df.loc[j:, 'E'] = df.loc[j:, 'E'].max()
它不适用于非常大的数据框...也许有人有一个主意 另一个代码或我的代码中的更正。
谢谢!
A B C D E
0 4000074 SP000796746 201926 -1 0
1 4000074 SP000796746 201926 0 1000.0
2 4000074 SP000796746 201926 1 1000.0
3 4000074 SP000796746 201926 2 0
4 4000074 SP000796746 201926 3 0
5 4000074 SP000796746 201926 4 0
6 4000074 SP000796746 201909 -1 NaN
7 4000074 SP000796746 201909 0 NaN
8 4000074 SP000796746 201909 1 NaN
9 4000074 SP000796746 201909 2 NaN
10 4000074 SP000796746 201909 3 NaN
11 4000074 SP000796746 201909 4 NaN
12 4000074 SP000796746 201933 -1 0
13 4000074 SP000796746 201933 0 0
14 4000074 SP000796746 201933 1 0
15 4000074 SP000796746 201933 2 3000.0
16 4000074 SP000796746 201933 3 3000.0
17 4000074 SP000796746 201933 4 0
答案 0 :(得分:2)
您可以使用D列和groupby.transform
中用新-1组成的组中的max
的{{1}}来完成。然后cumsum
原始列。
fillna
编辑:要填充零,您可以这样做:
df['E'] = df['E'].fillna(df['E'].groupby(df['D'].eq(-1).cumsum()).transform('max'))