Question

我正在分析一个数据集，该数据集具有原始ID（列A），目标ID（列B）以及它们之间发生的行程数（列计数）。现在我想把A-B旅行与B-A旅行相加。此总和是A和B之间的总行程数。

以下是我的数据的样子（不一定以相同的方式排序）：

    In [1]: group_station = pd.DataFrame([[1, 2, 100], [2, 1, 200], [4, 6, 5] , [6, 4, 10], [1, 4, 70]], columns=['A', 'B', 'Count'])
    Out[2]: 
       A  B Count
    0  1  2 100
    1  2  1 200
    2  4  6 5
    3  6  4 10
    4  1  4 70

我想要以下输出：

       A  B C
    0  1  2 300
    1  4  6 15
    4  1  4 70

我尝试过groupby并将索引设置为两个变量都没有成功。现在我正在做一个非常低效的双循环，这对于我的数据集的大小来说太慢了。

如果它有助于这是双循环的代码（我删除了一些效率修改以使其更清晰）：

# group_station is the dataframe
collapsed_group_station = np.zeros(len(group_station), 3))
for i, row in enumerate(group_station.iterrows()):
    start_id = row[0][0]
    end_id = row[0][1]
    count = row[1][0]

    for check_row in group_station.iterrows():
        check_start_id = check_row[0][0]
        check_end_id = check_row[0][1]
        check_time = check_row[1][0]

        if start_id == check_end_id and end_id == check_start_id:
            new_group_station[i][0] = start_id
            new_group_station[i][1] = end_id
            new_group_station[i][2] = time + check_time
            break

我有关于如何提高此代码效率的想法，但我想知道是否有办法在没有循环的情况下进行此操作。

Answer 1

您可以将np.sort与groupby.sum()

一起使用

import numpy as np; import pandas as pd
group_station[['A','B']]=np.sort(group_station[['A','B']],axis=1)
group_station.groupby(['A','B'],as_index=False).Count.sum()
Out[175]: 
   A  B  Count
0  1  2    300
1  1  4     70
2  4  6     15

如果两个值相等，则对pandas中的两行求和并折叠（顺序无关紧要）

1 个答案: