我有一个带间隔的熊猫数据框(由开始和停止定义):
df = pd.DataFrame(
{
'start': [1,1,1,2,2,2,2,3,3,3,3,3,3,3],
'stop': [9,9,10,10,10,11,11,11,11,12,11,12,11,11],
'percent' : [0.51,0.29,0.92,0.60,0.10,0.12,0.60,0.30,0.10,0.42,0.51,0.51,0.51,0.10],
'order':[3,80,3,3,4,8,89,2,3,4,5,64,82,68]
}
)
看起来像:
start stop percent order
1 9 0.51 3
1 9 0.29 80
1 10 0.92 3
2 10 0.60 3
2 10 0.10 4
2 11 0.12 8
2 11 0.60 89
3 11 0.30 2
3 11 0.10 3
3 12 0.42 4
3 11 0.51 5
3 12 0.51 64
3 11 0.51 82
3 11 0.10 68
我想计算,对于每个位置(通过分割区间):总计数、价值总和、订单总和
注意:示例中的原始数据框未按坐标排序。
我想得到一个数据框:
pos count sum_percent sum_order
1 3 1.72 86
2 7 3.14 190
3 14 5.59 418
4 14 5.59 418
5 14 5.59 418
6 14 5.59 418
7 14 5.59 418
8 14 5.59 418
9 14 5.59 418
10 12 4.79 335
11 9 3.17 325
12 2 0.93 68
我设法得到了我想要的计数列的结果:
max_pos=df[['start', 'stop']].values.max()
pos_range=np.arange(1, max_pos+1)
counts = ((df[['start']].values <= pos_range) & (pos_range <= df[['stop']].values)).sum(axis=0)
o = pd.DataFrame({'pos': pos_range, "counts": counts})
但是对于列的总和,我没能做到。 有什么帮助吗? 提前致谢
答案 0 :(得分:0)
使用用于计数的布尔变量作为索引:
import numpy as np
import pandas as pd
names=["start","stop","percent","order"]
vals=np.array([
[1,9,0.51, 3],
[1,9,0.29,80],
[1,10,0.92, 3],
[2,10,0.60, 3],
[2,10,0.10, 4],
[2,11,0.12, 8],
[2,11,0.60,89],
[3,11,0.30, 2],
[3,11,0.10, 3],
[3,12,0.42, 4],
[3,11,0.51, 5],
[3,12,0.51,64],
[3,11,0.51,82],
[3,11,0.10,68]
])
df = pd.DataFrame(vals, columns=names)
df
max_pos=df[['start', 'stop']].values.max()
pos_range=np.arange(1, max_pos+1)
_ix = ((df[['start']].values <= pos_range) & (pos_range <= df[['stop']].values))
counts = _ix.sum(axis=0)
sum_percent=[]
for i in _ix.T:
sum_percent.append(df["percent"].values[i].sum())
sum_order = []
for i in _ix.T:
sum_order.append(df["order"].values[i].sum())
o = pd.DataFrame({'pos': pos_range, "counts": counts, "sum_percent":sum_percent, "sum_order":sum_order})