我有两个数据框net
和M
。
net =
i j d
0 5 3 3
1 2 0 2
2 3 2 1
3 4 5 2
4 0 1 3
5 0 3 4
M =
0 1 2 3 4 5
0 0 3 2 4 1 5
1 3 0 2 0 3 3
2 2 2 0 1 1 4
3 4 0 1 0 3 3
4 1 3 1 3 0 2
5 5 3 4 3 2 0
我想在M
中找到net['d']
的相同值,在M
中随机选择一个单元格,然后创建一个包含该单元格坐标的新数据框。例如
net['d'][0] = 3
所以在M
我找到了:
M[0][1]
M[1][0]
M[1][4]
M[1][5]
...
最后net1
会是那样的
net1 =
i1 j1 d1
0 1 5 3
1 5 4 2
2 2 3 1
3 1 2 2
4 1 5 3
5 3 0 4
我正在做的事情:
I1 = []
J1 = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( M == tmp)
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
I1.append(h)
J1.append(w)
net1 = pd.DataFrame()
net1['i1'] = I1
net1['j1'] = J1
net1['d1'] = net['d']
我想知道哪种方法可以避免这种循环
答案 0 :(得分:0)
您可以堆叠M的列,然后使用替换
对其进行采样net = pd.DataFrame({'i':[5,2,3,4,0,0],
'j':[3,0,2,5,1,3],
'd':[3,2,1,2,3,4]})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
def random_net(net, M):
# make long table and randomize order of rows and rename columns
net1 = M.stack().reset_index()
net1.columns =['i1', 'j1', 'd1']
# get size of each group for random mapping
net1_id_length = net1.groupby('d1').size()
# add id column to uniquely identify row in net
net_copy = net.copy()
# first map gets size of each group and second gets random integer
net_copy['id'] = net_copy['d'].map(net1_id_length).map(np.random.randint)
net1['id'] = net1.groupby('d1').cumcount()
# make for easy lookup
net_copy = net_copy.set_index(['d', 'id'])
net1 = net1.set_index(['d1', 'id'])
# choose from net1 only those from original net
return net1.reindex(net_copy.index).reset_index('d').reset_index(drop=True).rename(columns={'d':'d1'})
random_net(net, M)
输出
d1 i1 j1
0 3 5 1
1 2 0 2
2 1 3 2
3 2 1 2
4 3 3 5
5 4 0 3
600万行的计时
n = 1000000
net = pd.DataFrame({'i':[5,2,3,4,0,0] * n,
'j':[3,0,2,5,1,3] * n,
'd':[3,2,1,2,3,4] * n})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
%timeit random_net(net, M)
1个循环,最佳3:每循环13.7秒