Question

考虑以下几点：

数据框：

id    endId   startId   ownerId   value
1     50          50          10        105 
2     51          50          10        240
3     52          50          10        420
4     53          53          10        470
5     40          40          11        320
6     41          40          11        18
7     55          55          12        50
8     57          55          12        412
9     59          55          12        398
10    60          57          12        320

我想对所有“值”列求和，它们的endId在同一ownerId的当前startId和当前endId之间。

输出应为：

id    endId   startId   ownerId   value    output
1     50          50          10        105      105     # Nothing between 50 and 50
2     51          50          10        240      345     # Found 1 record (endId with id 1)
3     52          50          10        420      765     # Found 2 records (endId with id 1 and 2)
4     53          53          10        470      470     # Nothing else between 53 and 53
5     40          40          11        320      320     # Reset because Owner is different
6     41          40          11        18       338     # Found 1 record (endId with id 5)
7     55          55          12        50       50      # ...
8     57          55          12        412      462
9     59          55          12        398      860
10    60          57          12        320      1130    # Found 3 records between 57 and 60 (endId with id 8, 9 and 10)

我尝试使用diff，groupby.cumsum等，但无法获得所需的信息...

Answer 1

我将使用numpy广播来标识您要查找的行：

# Create new df with ownerId as index
df2=df.set_index('ownerId')
df2['output']=0

# Loop over the various ownerIds
for k in df2.index:
    refend=df2.loc[k,'endId'].values
    refstart=df2.loc[k,'startId'].values

    # Identify values matching the condition
    i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))
    # Groupby and sum
    dfres=pd.concat([df2.loc[k].iloc[j].endId.reset_index(drop=True),
                     df2.loc[k].iloc[i].value.reset_index(drop=True)],
                    axis=1).groupby('endId').sum()
    df2.loc[k,'output']=dfres.value.values

# reset index
df2.reset_index(inplace=True)

输出为：

   ownerId  id  endId  startId  value  output
0       10   1     50       50    105     105
1       10   2     51       50    240     345
2       10   3     52       50    420     765
3       10   4     53       53    470     470
4       11   5     40       40    320     320
5       11   6     41       40     18     338
6       12   7     55       55     50      50
7       12   8     57       55    412     462
8       12   9     59       55    398     860
9       12  10     60       57    320    1130

修改

可以通过以下方法避免 for循环：

refend=df.loc[:,'endId'].values
refstart=df.loc[:,'startId'].values

i,j=np.where((refend[:,None]<=refend)&(refend[:,None]>=refstart))

dfres=pd.concat([df.iloc[j].endId.reset_index(drop=True),
                     df.loc[:,['ownerId','value']].iloc[i].reset_index(drop=True)],
                    axis=1).groupby(['ownerId','endId']).sum()

df['output']=dfres.value.values

Answer 2

我将df复制到df2，以保留原始数据。我建议您分两步完成任务：

#change everything
df2['output'] =  df.groupby('ownerId')['value'].cumsum()


#check and update if it applies
df2['output'] = np.where((df2['endId']<= df['startId']),                          
                           df2['value'],     #copy value from
                           df2['output'])    #place value into 

print(df2)
id  endId  startId  ownerId  value  output
0   1     50       50       10    105     105
1   2     51       50       10    240     345
2   3     52       50       10    420     765
3   4     53       53       10    470     470
4   5     40       40       11    320     320
5   6     41       40       11     18     338
6   7     55       55       12     50      50
7   8     57       55       12    412     462
8   9     59       55       12    398     860
9  10     60       57       12    320    1180

打印逻辑：

对不起人们，但我还是不明白。对于ownerId 10和11，endId和startId共享相同值的记录将被累加。而且似乎还可以。但是出于某种原因，您说的是同一规则不适用于ownerId 12。我了解应该考虑7到10之间的ID。模式似乎是不计算endId和startId时的值匹配最高值，它发生在ID 4上。

仅当先前ID在2个值之间时，CumSum列

2 个答案: