给定组使用熊猫的所有项目的笛卡尔积

时间:2019-06-06 19:57:46

标签: python pandas pandas-groupby itertools

所以我从一个看起来像这样的DataFrame开始:

       id        tof
0    43.0  1999991.0
1    43.0  2095230.0
2    43.0  4123105.0
3    43.0  5560423.0
4    46.0  2098996.0
5    46.0  2114971.0
6    46.0  4130033.0
7    46.0  4355096.0
8    82.0  2055207.0
9    82.0  2093996.0
10   82.0  4193587.0
11   90.0  2059360.0
12   90.0  2083762.0
13   90.0  2648235.0
14   90.0  4212177.0
15  103.0  1993306.0
          .
          .
          .

最终,我的目标是创建一个非常长的二维数组,该数组包含具有相同id的所有项组合,如下所示(对于id为43的行):

[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]

除了将所有元组更改为数组,以便在通过所有id号进行插值后可以转置数组。

自然地,我想到了itertools,我的第一个想法是对df.groupby('id')做一些事情,以便将itertools内部应用到具有相同id的每个组中,但是我想这对于i我有百万个行数据文件。

有矢量化的方法吗?

3 个答案:

答案 0 :(得分:2)

IIUC:

from itertools import combinations

pd.DataFrame([
    [k, c0, c1] for k, tof in df.groupby('id').tof
           for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])

      id       tof0       tof1
0   43.0  1999991.0  2095230.0
1   43.0  1999991.0  4123105.0
2   43.0  1999991.0  5560423.0
3   43.0  2095230.0  4123105.0
4   43.0  2095230.0  5560423.0
5   43.0  4123105.0  5560423.0
6   46.0  2098996.0  2114971.0
7   46.0  2098996.0  4130033.0
8   46.0  2098996.0  4355096.0
9   46.0  2114971.0  4130033.0
10  46.0  2114971.0  4355096.0
11  46.0  4130033.0  4355096.0
12  82.0  2055207.0  2093996.0
13  82.0  2055207.0  4193587.0
14  82.0  2093996.0  4193587.0
15  90.0  2059360.0  2083762.0
16  90.0  2059360.0  2648235.0
17  90.0  2059360.0  4212177.0
18  90.0  2083762.0  2648235.0
19  90.0  2083762.0  4212177.0
20  90.0  2648235.0  4212177.0

说明

这是一个列表推导,它返回由数据帧构造函数包装的列表的列表。 Look up comprehensions to understand better.

from itertools import combinations

pd.DataFrame([
    #            name   series of tof values
    #               ↓   ↓    
    [k, c0, c1] for k, tof in df.groupby('id').tof
    #    items from combinations
    #      first    second
    #          ↓    ↓
           for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])

答案 1 :(得分:1)

from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))

如果您希望元素不重复,则可以使用

from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))

答案 2 :(得分:1)

Groupby确实有效:

def get_product(x):
    return pd.MultiIndex.from_product((x.tof, x.tof)).values

for i, g in df.groupby('id'):
    print(i, get_product(g))

输出:

43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
 (1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
 (2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
 (4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
 (5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
 (5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
 (2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
 (2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
 (4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
 (4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
 (4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
 (2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
 (4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
 (2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
 (2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
 (2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
 (4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
 (4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]