我正在寻找一个函数,当数组的长度为100万时,它会根据ordered_ids创建一个新的值数组。
输入:
>>> ids=array(["WYOMING01","TEXAS01","TEXAS02",...])
>>> values=array([12,20,30,...])
>>> ordered_ids=array(["TEXAS01","TEXAS02","ALABAMA01",...])
输出:
ordered [ 20 , 30 , nan , ...]
结束摘要
@ Dietrich在列表理解中使用字典比使用numpy索引搜索(numpy.where)快10倍。我在下面的答案中比较了三个结果的时间。
答案 0 :(得分:1)
你可以尝试:
import numpy as np
def order_array(ids, values, master_order_ids):
n = len(master_order_ids)
idx = np.searchsorted(master_order_ids, ids)
ordered_values = np.zeros(n)
ordered_values[idx < n] = values[idx < n]
print "ordered", ordered_values
return ordered_values
Searchsorted为您提供索引,您应该在其中将id插入master_order_ids以保持arrray的顺序。然后你只删除那些超出master_order_ids范围的那些(idx,值)。
答案 1 :(得分:0)
您可以尝试使用dict()
将蜇伤与您的号码相关联。它大大简化了代码:
import numpy as np
def order_bydict(ids,values,master_order_ids):
""" Using a dict to order ``master_order_ids`` """
dd = dict([(k,v) for k,v in zip(ids, values)]) # create the dict
ordered_values = [dd.get(m, 0) for m in master_order_ids] # get() return 0 if key not found
return np.asarray(ordered_values) # return a numpy array instead of a list
如果不测试更长的阵列,很难预测速度提升(根据%timeit
,您的示例速度提高了25%)。
答案 2 :(得分:0)
import numpy
from numpy import copy, random, arange
import time
# SETUP
N=10**4
ids = arange(0,N).astype(str)
values = arange(0,N)
numpy.random.shuffle(ids)
numpy.random.shuffle(values)
ordered_ids=arange(0,N).astype(str)
ordered_values = numpy.empty((N,1))
ordered_values[:] = numpy.NAN
# METHOD 1
start = time.clock()
for i in range(len(values)):ordered_values[ordered_ids==ids[i]]=values[i]
print "not using dictionary:", time.clock() - start
# METHOD 2
start = time.clock()
d = dict(zip(ids, values))
for k, v in d.iteritems(): ordered_values[ordered_ids==k] = v
print "using dictionary:", time.clock() - start
# METHOD 3 @Dietrich's approach in the answer above
start = time.clock()
dd = dict(zip(ids, values))
ordered_values = [dd.get(m, 0) for m in ordered_ids]
print "using dictionary with list comprehension:", time.clock() - start
not using dictionary: 1.320237 # Method 1
using dictionary: 1.327119 # Method 2
using dictionary with list comprehension: 0.013287 # @Dietrich
答案 3 :(得分:0)
以下使用numpy_indexed包的解决方案(免责声明:我是其作者)纯粹是矢量化的,并且可能比目前发布的解决方案更有效:
import numpy_indexed as npi
idx = npi.indices(ids, ordered_ids, missing='mask')
new_values = values[idx]
new_values[idx.mask] = -1 # or cast to float and set to nan, but you get the idea...