Python:如何在没有循环的情况下在数据框中查找值?

时间:2016-12-08 19:57:59

标签: python pandas merge group-by

我有两个数据框netM

net =
        i  j   d
    0   5  3   3 
    1   2  0   2
    2   3  2   1 
    3   4  5   2   
    4   0  1   3
    5   0  3   4


M =
    0    1    2    3    4    5
0   0    3    2    4    1    5 
1   3    0    2    0    3    3 
2   2    2    0    1    1    4 
3   4    0    1    0    3    3     
4   1    3    1    3    0    2
5   5    3    4    3    2    0

我想在M中找到net['d']的相同值,在M中随机选择一个单元格,然后创建一个包含该单元格坐标的新数据框。例如

net['d'][0] = 3  

所以在M我找到了:

M[0][1]
M[1][0]
M[1][4]
M[1][5]
...

最后net1会是那样的

   net1 =
       i1  j1   d1
    0   1   5    3 
    1   5   4    2
    2   2   3    1 
    3   1   2    2   
    4   1   5    3
    5   3   0    4

我正在做的事情:

I1 = []
J1 = []
for i in net.index:
    tmp = net['d'][i]
    ds = np.where( M == tmp)
    size = len(ds[0])
    ind = randint(size) ## find two random locations with distance ds
    h = ds[0][ind]
    w = ds[1][ind]
    I1.append(h)
    J1.append(w)
net1 = pd.DataFrame()
net1['i1'] = I1
net1['j1'] = J1
net1['d1'] = net['d']

我想知道哪种方法可以避免这种循环

1 个答案:

答案 0 :(得分:0)

您可以堆叠M的列,然后使用替换

对其进行采样
net = pd.DataFrame({'i':[5,2,3,4,0,0], 
                    'j':[3,0,2,5,1,3], 
                    'd':[3,2,1,2,3,4]})

M = pd.DataFrame({0:[0,3,2,4,1,5], 
                  1:[3,0,2,0,3,3], 
                  2:[2,2,0,1,1,4],
                  3:[4,0,1,0,3,3],
                  4:[1,3,1,3,0,2],
                  5:[5,3,4,3,2,0]})

def random_net(net, M):
    # make long table and randomize order of rows and rename columns
    net1 = M.stack().reset_index()
    net1.columns =['i1', 'j1', 'd1']

    # get size of each group for random mapping
    net1_id_length = net1.groupby('d1').size()

    # add id column to uniquely identify row in net
    net_copy = net.copy()

    # first map gets size of each group and second gets random integer
    net_copy['id'] = net_copy['d'].map(net1_id_length).map(np.random.randint)
    net1['id'] = net1.groupby('d1').cumcount()

    # make for easy lookup
    net_copy = net_copy.set_index(['d', 'id'])
    net1 = net1.set_index(['d1', 'id'])

    # choose from net1 only those from original net
    return net1.reindex(net_copy.index).reset_index('d').reset_index(drop=True).rename(columns={'d':'d1'})

random_net(net, M)

输出

   d1  i1  j1
0   3   5   1
1   2   0   2
2   1   3   2
3   2   1   2
4   3   3   5
5   4   0   3

600万行的计时

n = 1000000
net = pd.DataFrame({'i':[5,2,3,4,0,0] * n, 
                    'j':[3,0,2,5,1,3] * n, 
                    'd':[3,2,1,2,3,4] * n})

M = pd.DataFrame({0:[0,3,2,4,1,5], 
                  1:[3,0,2,0,3,3], 
                  2:[2,2,0,1,1,4],
                  3:[4,0,1,0,3,3],
                  4:[1,3,1,3,0,2],
                  5:[5,3,4,3,2,0]})

%timeit random_net(net, M)

1个循环,最佳3:每循环13.7秒