我有一个数据框,对于某些列,一行取决于前一行的值。同样,这种依赖性仅在例如“ gid”标识的组中。
我所做的基本上是创建另一个数据框,然后转置用于计算的列。我在所附代码中使用的步骤如下。
gid id x y
0 1 0 1.624345 0.876389
1 1 1 -0.611756 0.894607
2 1 2 -0.528172 0.085044
3 1 3 -1.072969 0.039055
4 1 4 0.865408 0.169830
5 2 0 -2.301539 0.878143
6 2 1 1.744812 0.098347
7 2 2 -0.761207 0.421108
8 2 3 0.319039 0.957890
9 2 4 -0.249370 0.533165
10 3 0 1.462108 0.691877
11 3 1 -2.060141 0.315516
12 3 2 -0.322417 0.686501
13 3 3 -0.384054 0.834626
14 3 4 1.133769 0.018288
x0 x1 x2 x3 x4 y0 y1 \
gid
1 1.624345 -0.611756 -0.528172 -1.072969 0.865408 0.876389 0.894607
2 -2.301539 1.744812 -0.761207 0.319039 -0.249370 0.878143 0.098347
3 1.462108 -2.060141 -0.322417 -0.384054 1.133769 0.691877 0.315516
y2 y3 y4
gid
1 0.085044 0.039055 0.169830
2 0.421108 0.957890 0.533165
3 0.686501 0.834626 0.018288
gid x0 x1 x2 x3 x4 y0 y1 \
0 1 1.624345 -0.611756 -0.528172 -1.072969 0.865408 0.876389 0.894607
1 1 1.624345 -0.611756 -0.528172 -1.072969 0.865408 0.876389 0.894607
2 1 1.624345 -0.611756 -0.528172 -1.072969 0.865408 0.876389 0.894607
3 1 1.624345 -0.611756 -0.528172 -1.072969 0.865408 0.876389 0.894607
4 1 1.624345 -0.611756 -0.528172 -1.072969 0.865408 0.876389 0.894607
5 2 -2.301539 1.744812 -0.761207 0.319039 -0.249370 0.878143 0.098347
6 2 -2.301539 1.744812 -0.761207 0.319039 -0.249370 0.878143 0.098347
7 2 -2.301539 1.744812 -0.761207 0.319039 -0.249370 0.878143 0.098347
8 2 -2.301539 1.744812 -0.761207 0.319039 -0.249370 0.878143 0.098347
9 2 -2.301539 1.744812 -0.761207 0.319039 -0.249370 0.878143 0.098347
10 3 1.462108 -2.060141 -0.322417 -0.384054 1.133769 0.691877 0.315516
11 3 1.462108 -2.060141 -0.322417 -0.384054 1.133769 0.691877 0.315516
12 3 1.462108 -2.060141 -0.322417 -0.384054 1.133769 0.691877 0.315516
13 3 1.462108 -2.060141 -0.322417 -0.384054 1.133769 0.691877 0.315516
14 3 1.462108 -2.060141 -0.322417 -0.384054 1.133769 0.691877 0.315516
y2 y3 y4 id
0 0.085044 0.039055 0.169830 0
1 0.085044 0.039055 0.169830 1
2 0.085044 0.039055 0.169830 2
3 0.085044 0.039055 0.169830 3
4 0.085044 0.039055 0.169830 4
5 0.421108 0.957890 0.533165 0
6 0.421108 0.957890 0.533165 1
7 0.421108 0.957890 0.533165 2
8 0.421108 0.957890 0.533165 3
9 0.421108 0.957890 0.533165 4
10 0.686501 0.834626 0.018288 0
11 0.686501 0.834626 0.018288 1
12 0.686501 0.834626 0.018288 2
13 0.686501 0.834626 0.018288 3
14 0.686501 0.834626 0.018288 4
gid output id
0 1 1.624345 0
1 1 2.518952 1
2 1 2.603996 2
3 1 2.643051 3
4 1 2.812881 4
5 2 -2.301539 0
6 2 1.744812 1
7 2 2.165919 2
8 2 3.123809 3
9 2 3.656974 4
10 3 1.462108 0
11 3 1.777624 1
12 3 2.464124 2
13 3 3.298750 3
14 3 3.317038 4
gid id x y output
0 1 0 1.624345 0.876389 1.624345
1 1 1 -0.611756 0.894607 2.518952
2 1 2 -0.528172 0.085044 2.603996
3 1 3 -1.072969 0.039055 2.643051
4 1 4 0.865408 0.169830 2.812881
5 2 0 -2.301539 0.878143 -2.301539
6 2 1 1.744812 0.098347 1.744812
7 2 2 -0.761207 0.421108 2.165919
8 2 3 0.319039 0.957890 3.123809
9 2 4 -0.249370 0.533165 3.656974
10 3 0 1.462108 0.691877 1.462108
11 3 1 -2.060141 0.315516 1.777624
12 3 2 -0.322417 0.686501 2.464124
13 3 3 -0.384054 0.834626 3.298750
14 3 4 1.133769 0.018288 3.317038
但是我认为必须有一些更好的方法来实现相同的目标。我正在考虑使用groupby,然后通过使用id列进行应用,但是还没有弄清楚该怎么做。任何帮助表示赞赏。
随附完整的代码。
import numpy as np
import pandas as pd
# 1.
df = pd.DataFrame({'gid':np.repeat([1,2,3], 5),
'id': [0, 1, 2, 3, 4] *3,
'x': np.random.randn(15),
'y': np.random.random(15)})
# 2.
columns = ['gid', 'id', 'x', 'y']
_df = df[columns].set_index(['gid', 'id']).unstack()
_df.columns = _df.columns.map(lambda x: '{}{}'.format(x[0], x[1]))
# 3.
_df = _df.join(df.set_index('gid')['id'],
how='left').reset_index().set_index(df.index)
# 4.
for i in range(1, 5):
_df['x' + str(i)] = np.fmax(_df['x' + str(i)], _df['x' + str(i - 1)] + _df['y' + str(i)])
columns = pd.Index([column for column in _df.columns
if column.find('x') >= 0], name='x')
_df = _df.reindex(columns=columns).groupby(_df['gid']).last()
_df = _df.stack().reset_index().rename(columns={0: 'output'}).drop('x', axis=1)
_df['id'] = _df.groupby('gid').cumcount()
# 5.
df = df.join(_df[['output']])
答案 0 :(得分:0)
我认为以下代码将完成这项工作,非常简单。
def myfunc(df, id=0, column='x'):
return np.fmax(df.loc[df['id'] == id, column],
np.add(df.loc[df['id'] == id - 1, column],
df.loc[df['id'] == id, 'y']))
for id in range(1, 5):
df_1.loc[df_1['id'] == id, 'x'] = \
df_1.groupby('gid').apply(myfunc, id=id).values