Question

我有一个非常大的数据框（数亿行）。有两个组ID，group_id_1和group_id_2。数据框如下所示：

group_id_1    group_id_2    value1    time
1             2             45        1
1             2             49        2
1             4             95        1
1             4             55        2
2             2             44        1
2             4             88        1
2             4             90        2

对于每个group_id_1 x group_id_2组合，我需要复制具有最新时间的行，并将时间增加1。换句话说，我的表应该如下：

group_id_1    group_id_2    value1    time
1             2             45        1
1             2             49        2
1             2             49        3
1             4             95        1
1             4             55        2
1             4             55        3
2             2             44        1
2             2             44        2
2             4             88        1
2             4             90        2
2             4             90        3

现在，我正在做：

for name, group in df.groupby(['group_id_1', 'group_id_2']):
    last, = group.sort_values(by='time').tail(1)['time'].values
    temp = group[group['time']==last]
    temp.loc[:, 'time'] = last + 1
    group = group.append(temp)

这非常低效。如果我将上述代码放入函数中，并将.apply()方法与groupby对象一起使用，则也需要花费大量时间。

如何加快此过程？

Answer 1

您可以将groupby与汇总last一起使用，将add和concat添加到原始时间：

<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet"/>
<table class="table">
    <thead>
      <tr>
        <th>Option</th>
        <th>ID</th>
        <th>Name</th>
        <th>Price</th>
        <th>Quantity</th>
        <th>Total</th>
      </tr>
      
       <!--<?php 
        $cart = unserialize(serialize($_SESSION['cart']));
        $sum = 0; 
        for($i = 0; $i < count($cart); $i++){
            $sum += $cart[$i]->price * $cart[$i]->quantity;
        ?>-->
        
      <tr class="success">
        <td><a href="cart.php?id=<?php echo $cart[$i]->id; ?>&action=delete" onClick="return confirm('Are you sure?')">Delete</a></td>
        <td><?php echo $cart[$i]->id; ?></td>
        <td><?php echo $cart[$i]->name; ?></td>
        <td><?php echo $cart[$i]->price; ?></td>
        <td><?php echo $cart[$i]->quantity; ?></td>
        <td><?php echo $cart[$i]->price * $cart[$i]->quantity; ?></td>
      </tr>
            
      <?php } ?>
         
      <!--<div class="text-right">-->
      <tr class="danger">
        <td class="text-right" colspan="6">Sum &nbsp;<span><?php echo $sum ?>0</span></td>
      </tr>
      <!--</div>-->
    </thead>
</table>

Answer 2

首先，按时间对数据框进行排序（这应该比按时间对每个组进行排序更有效）：

df = df.sort_values('time')

第二次，获取每个组中的最后一行（不对组进行排序以提高性能）：

last = df.groupby(['group_id_1', 'group_id_2'], sort=False).last()

第三次，增加时间：

last['time'] = last['time'] + 1

第四，连接：

df = pd.concat([df, last])

第五，排序回原始订单：

df = df.sort_values(['group_id_1', 'group_id_2'])

说明：连接然后排序将比逐个插入行快得多。

加快Pandas groupby中行的复制？

2 个答案: