迭代numpy数组并更新每个元素的最快方法

时间:2018-05-27 14:01:42

标签: python numpy

这对你们来说可能很奇怪,但我碰巧有这个奇怪的目标要实现,代码如下:

# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
       [509849, 252786, 710979],
       [379621, 718598, 591201],
       [509849,  35700, 951719]])

# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}

# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
    if i not in M:
        M[i] = len(M)  # sparse ID got a dense one
    i[...] = M[i]   # replace sparse one with the dense ID

我的目标可以通过np.unique(A, return_inverse=True)实现,return_inverse结果就是我想要的。

但是,我所拥有的numpy数组太大而无法完全加载到内存中,所以我无法在整个数据上运行np.unique,这就是为什么我想出了这个dict-mapping的想法...

这是正确的方法吗?任何可能的改进?

1 个答案:

答案 0 :(得分:0)

我将尝试通过在子数组上使用Value cannot be null. Parameter name: connectionString来提供另一种方法。此解决方案尚未完全测试。我也没有做任何并排的绩效评估,因为你的解决方案并不能完全为我工作。

假设我们有一个数组public class InsigContext : DbContext { private readonly string _connectionString; public InsigContext(DbContextOptions<InsigContext> options, IConfiguration configuration) : base(options) { _connectionString = configuration.GetSection("ConnectionStrings:DefaultConnection").Value; } public InsigContext() { } public DbSet<Sample> Samples { get; set; } protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder) { if (!optionsBuilder.IsConfigured) { optionsBuilder.UseSqlServer(_connectionString); } } } ,我们将它们分成两个较小的数组。让我们创建一些测试数据,例如:

numpy.unique()

这里我们假设c是大数组,>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]]) >>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]]) >>> c = np.vstack([a, b]) >>> print(c) [[ 1 1 2 3 4] [ 1 2 6 6 2] [ 8 0 1 1 4] [11 2 -1 12 6] [12 2 6 11 2] [ 7 0 3 1 3]] c是子数组。当然,可以先构建a,然后提取子阵列。

下一步是在两个子阵列上运行b

c

现在,这是一个用于组合子阵列结果的算法:

numpy.unique()

现在,让我们从子数组中对>>> ua, ia = np.unique(a, return_inverse=True) >>> ub, ib = np.unique(b, return_inverse=True) >>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference 的结果运行此函数,然后将合并的索引和唯一值与参考结果def merge_unique(ua, ia, ub, ib): # make copies *if* changing inputs is undesirable: ua = ua.copy() ia = ia.copy() ub = ub.copy() ib = ib.copy() # find differences between unique values in the two arrays: diffab = np.setdiff1d(ua, ub, assume_unique=True) diffba = np.setdiff1d(ub, ua, assume_unique=True) # find indices in ua, ub where to insert "other" unique values: ssa = np.searchsorted(ua, diffba) ssb = np.searchsorted(ub, diffab) # throw away values that are too large: ssa = ssa[np.where(ssa < len(ua))] ssb = ssb[np.where(ssb < len(ub))] # increment indices past previously computed "insert" positions: for v in ssa[::-1]: ia[ia >= v] += 1 for v in ssb[::-1]: ib[ib >= v] += 1 # combine results: uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub ic = np.concatenate([ia, ib]) return uc, ic numpy.unique()进行比较:

uc

分割成两个以上的子阵列只需要额外的工作即可处理 - 只需保持累积“唯一”值和索引,如下所示:

ic