结果

Question

我正在寻找一个函数，当数组的长度为100万时，它会根据ordered_ids创建一个新的值数组。

输入：

    >>> ids=array(["WYOMING01","TEXAS01","TEXAS02",...])
    >>> values=array([12,20,30,...])
    >>> ordered_ids=array(["TEXAS01","TEXAS02","ALABAMA01",...])

输出：

    ordered [  20 , 30 , nan , ...]

结束摘要

@ Dietrich在列表理解中使用字典比使用numpy索引搜索（numpy.where）快10倍。我在下面的答案中比较了三个结果的时间。

Answer 1

你可以尝试：

import numpy as np

def order_array(ids, values, master_order_ids):
    n = len(master_order_ids)
    idx = np.searchsorted(master_order_ids, ids)
    ordered_values = np.zeros(n)
    ordered_values[idx < n] = values[idx < n]
    print "ordered", ordered_values
    return ordered_values

Searchsorted为您提供索引，您应该在其中将id插入master_order_ids以保持arrray的顺序。然后你只删除那些超出master_order_ids范围的那些（idx，值）。

Answer 2

您可以尝试使用dict()将蜇伤与您的号码相关联。它大大简化了代码：

import numpy as np

def order_bydict(ids,values,master_order_ids):
    """ Using a dict to order ``master_order_ids`` """

    dd = dict([(k,v) for k,v in zip(ids, values)])  # create the dict
    ordered_values = [dd.get(m, 0) for m in master_order_ids]  # get() return 0 if key not found

    return np.asarray(ordered_values)  # return a numpy array instead of a list

如果不测试更长的阵列，很难预测速度提升（根据%timeit，您的示例速度提高了25％）。

Answer 3

import numpy
from numpy import copy, random, arange
import time

# SETUP    
N=10**4
ids = arange(0,N).astype(str)
values = arange(0,N)
numpy.random.shuffle(ids)
numpy.random.shuffle(values)
ordered_ids=arange(0,N).astype(str)


ordered_values = numpy.empty((N,1))
ordered_values[:] = numpy.NAN

# METHOD 1
start = time.clock()
for i in range(len(values)):ordered_values[ordered_ids==ids[i]]=values[i]
print "not using dictionary:", time.clock() - start

# METHOD 2
start = time.clock()
d = dict(zip(ids, values))
for k, v in d.iteritems(): ordered_values[ordered_ids==k] = v
print "using dictionary:", time.clock() - start

# METHOD 3 @Dietrich's approach in the answer above
start = time.clock()
dd = dict(zip(ids, values))
ordered_values = [dd.get(m, 0) for m in ordered_ids]
print "using dictionary with list comprehension:", time.clock() - start

结果

not using dictionary: 1.320237 # Method 1
using dictionary: 1.327119 # Method 2
using dictionary with list comprehension: 0.013287 # @Dietrich

Answer 4

以下使用numpy_indexed包的解决方案（免责声明：我是其作者）纯粹是矢量化的，并且可能比目前发布的解决方案更有效：

import numpy_indexed as npi
idx = npi.indices(ids, ordered_ids, missing='mask')
new_values = values[idx]
new_values[idx.mask] = -1   # or cast to float and set to nan, but you get the idea...

根据其关联的id在`master_order`数组中的位置重新排序numpy数组

4 个答案:

结果