所以我从一个看起来像这样的DataFrame开始:
id tof
0 43.0 1999991.0
1 43.0 2095230.0
2 43.0 4123105.0
3 43.0 5560423.0
4 46.0 2098996.0
5 46.0 2114971.0
6 46.0 4130033.0
7 46.0 4355096.0
8 82.0 2055207.0
9 82.0 2093996.0
10 82.0 4193587.0
11 90.0 2059360.0
12 90.0 2083762.0
13 90.0 2648235.0
14 90.0 4212177.0
15 103.0 1993306.0
.
.
.
最终,我的目标是创建一个非常长的二维数组,该数组包含具有相同id的所有项组合,如下所示(对于id为43的行):
[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]
除了将所有元组更改为数组,以便在通过所有id号进行插值后可以转置数组。
自然地,我想到了itertools,我的第一个想法是对df.groupby('id')
做一些事情,以便将itertools内部应用到具有相同id的每个组中,但是我想这对于i我有百万个行数据文件。
有矢量化的方法吗?
答案 0 :(得分:2)
IIUC:
from itertools import combinations
pd.DataFrame([
[k, c0, c1] for k, tof in df.groupby('id').tof
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
id tof0 tof1
0 43.0 1999991.0 2095230.0
1 43.0 1999991.0 4123105.0
2 43.0 1999991.0 5560423.0
3 43.0 2095230.0 4123105.0
4 43.0 2095230.0 5560423.0
5 43.0 4123105.0 5560423.0
6 46.0 2098996.0 2114971.0
7 46.0 2098996.0 4130033.0
8 46.0 2098996.0 4355096.0
9 46.0 2114971.0 4130033.0
10 46.0 2114971.0 4355096.0
11 46.0 4130033.0 4355096.0
12 82.0 2055207.0 2093996.0
13 82.0 2055207.0 4193587.0
14 82.0 2093996.0 4193587.0
15 90.0 2059360.0 2083762.0
16 90.0 2059360.0 2648235.0
17 90.0 2059360.0 4212177.0
18 90.0 2083762.0 2648235.0
19 90.0 2083762.0 4212177.0
20 90.0 2648235.0 4212177.0
这是一个列表推导,它返回由数据帧构造函数包装的列表的列表。 Look up comprehensions to understand better.
from itertools import combinations
pd.DataFrame([
# name series of tof values
# ↓ ↓
[k, c0, c1] for k, tof in df.groupby('id').tof
# items from combinations
# first second
# ↓ ↓
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
答案 1 :(得分:1)
from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))
如果您希望元素不重复,则可以使用
from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))
答案 2 :(得分:1)
Groupby确实有效:
def get_product(x):
return pd.MultiIndex.from_product((x.tof, x.tof)).values
for i, g in df.groupby('id'):
print(i, get_product(g))
输出:
43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
(1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
(2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
(4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
(5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
(5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
(2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
(2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
(4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
(4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
(4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
(2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
(4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
(2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
(2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
(2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
(4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
(4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]