熊猫:根据另一列中相似的连续数字求和多列

时间:2018-08-21 22:02:49

标签: python pandas

给出下表

+----+--------+--------+--------------+
| Nr | Price  | Volume | Transactions |
+----+--------+--------+--------------+
|  1 |  194.6 |    100 |            1 |
|  2 |    195 |     10 |            1 |
|  3 | 194.92 |    100 |            1 |
|  4 | 194.92 |     52 |            1 |
|  5 |  194.9 |     99 |            1 |
|  6 | 194.86 |     74 |            1 |
|  7 | 194.85 |    900 |            1 |
|  8 | 194.85 |     25 |            1 |
|  9 | 194.85 |    224 |            1 |
| 10 |  194.6 |    101 |            1 |
| 11 | 194.85 |     19 |            1 |
| 12 |  194.6 |     10 |            1 |
| 13 |  194.6 |     25 |            1 |
| 14 | 194.53 |     12 |            1 |
| 15 | 194.85 |     14 |            1 |
| 16 |  194.6 |     11 |            1 |
| 17 | 194.85 |     93 |            1 |
| 18 |    195 |     90 |            1 |
| 19 |    195 |    100 |            1 |
| 20 |    195 |     50 |            1 |
| 21 |    195 |     50 |            1 |
| 22 |    195 |     25 |            1 |
| 23 |    195 |      5 |            1 |
| 24 |    195 |    500 |            1 |
| 25 |    195 |    100 |            1 |
| 26 | 195.09 |    100 |            1 |
| 27 |    195 |    120 |            1 |
| 28 |    195 |     60 |            1 |
| 29 |    195 |     40 |            1 |
| 30 |    195 |     10 |            1 |
| 31 |  194.6 |      1 |            1 |
| 32 | 194.99 |      1 |            1 |
| 33 | 194.81 |     20 |            1 |
| 34 | 194.81 |     50 |            1 |
| 35 | 194.97 |     17 |            1 |
| 36 | 194.99 |     25 |            1 |
| 37 |    195 |     75 |            1 |
+----+--------+--------+--------------+

为了更快地进行测试,您还可以在熊猫数据框中找到同一表格

pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])

问题是我们如何根据相似的连续价格来总结交易量和交易量?最终结果将是这样的:

+----+--------+--------+--------------+
| Nr | Price  | Volume | Transactions |
+----+--------+--------+--------------+
|  1 |  194.6 |    100 |            1 |
|  2 |    195 |     10 |            1 |
|  4 | 194.92 |    152 |            2 |
|  5 |  194.9 |     99 |            1 |
|  6 | 194.86 |     74 |            1 |
|  9 | 194.85 |   1149 |            3 |
| 10 |  194.6 |    101 |            1 |
| 11 | 194.85 |     19 |            1 |
| 13 |  194.6 |     35 |            2 |
| 14 | 194.53 |     12 |            1 |
| 15 | 194.85 |     14 |            1 |
| 16 |  194.6 |     11 |            1 |
| 17 | 194.85 |     93 |            1 |
| 25 |    195 |    920 |            8 |
| 26 | 195.09 |    100 |            1 |
| 30 |    195 |    230 |            4 |
| 31 |  194.6 |      1 |            1 |
| 32 | 194.99 |      1 |            1 |
| 34 | 194.81 |     70 |            2 |
| 35 | 194.97 |     17 |            1 |
| 36 | 194.99 |     25 |            1 |
| 37 |    195 |     75 |            1 |
+----+--------+--------+--------------+

您还可以在下面的pandas数据框中找到准备好的结果:

pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])

我设法在for循环中实现了这一点。但是问题在于,迭代每一行时它非常慢。我的数据集非常庞大,大约有5000万行。 有什么方法可以不循环而实现?

1 个答案:

答案 0 :(得分:2)

对连续值进行分组的常见技巧如下:

df.col.ne(df.col.shift()).cumsum()

我们可以在这里使用它,然后使用agg保留我们求和的列的第一个值,并对我们想要求和的值求和。

(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
    .agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)

    Nr   Price  Volume  Transactions 
0    1  194.60     100             1 
1    2  195.00      10             1 
2    4  194.92     152             2 
3    5  194.90      99             1 
4    6  194.86      74             1 
5    9  194.85    1149             3 
6   10  194.60     101             1 
7   11  194.85      19             1 
8   13  194.60      35             2 
9   14  194.53      12             1 
10  15  194.85      14             1 
11  16  194.60      11             1 
12  17  194.85      93             1 
13  25  195.00     920             8 
14  26  195.09     100             1 
15  30  195.00     230             4 
16  31  194.60       1             1 
17  32  194.99       1             1 
18  34  194.81      70             2 
19  35  194.97      17             1 
20  36  194.99      25             1 
21  37  195.00      75             1