这对你们来说可能很奇怪,但我碰巧有这个奇怪的目标要实现,代码如下:
# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
[509849, 252786, 710979],
[379621, 718598, 591201],
[509849, 35700, 951719]])
# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}
# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
if i not in M:
M[i] = len(M) # sparse ID got a dense one
i[...] = M[i] # replace sparse one with the dense ID
我的目标可以通过np.unique(A, return_inverse=True)
实现,return_inverse
结果就是我想要的。
但是,我所拥有的numpy数组太大而无法完全加载到内存中,所以我无法在整个数据上运行np.unique,这就是为什么我想出了这个dict-mapping的想法...
这是正确的方法吗?任何可能的改进?
答案 0 :(得分:0)
我将尝试通过在子数组上使用Value cannot be null. Parameter name: connectionString
来提供另一种方法。此解决方案尚未完全测试。我也没有做任何并排的绩效评估,因为你的解决方案并不能完全为我工作。
假设我们有一个数组public class InsigContext : DbContext
{
private readonly string _connectionString;
public InsigContext(DbContextOptions<InsigContext> options, IConfiguration configuration) : base(options)
{
_connectionString = configuration.GetSection("ConnectionStrings:DefaultConnection").Value;
}
public InsigContext() { }
public DbSet<Sample> Samples { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
if (!optionsBuilder.IsConfigured)
{
optionsBuilder.UseSqlServer(_connectionString);
}
}
}
,我们将它们分成两个较小的数组。让我们创建一些测试数据,例如:
numpy.unique()
这里我们假设c
是大数组,>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1 1 2 3 4]
[ 1 2 6 6 2]
[ 8 0 1 1 4]
[11 2 -1 12 6]
[12 2 6 11 2]
[ 7 0 3 1 3]]
和c
是子数组。当然,可以先构建a
,然后提取子阵列。
下一步是在两个子阵列上运行b
:
c
现在,这是一个用于组合子阵列结果的算法:
numpy.unique()
现在,让我们从子数组中对>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference
的结果运行此函数,然后将合并的索引和唯一值与参考结果def merge_unique(ua, ia, ub, ib):
# make copies *if* changing inputs is undesirable:
ua = ua.copy()
ia = ia.copy()
ub = ub.copy()
ib = ib.copy()
# find differences between unique values in the two arrays:
diffab = np.setdiff1d(ua, ub, assume_unique=True)
diffba = np.setdiff1d(ub, ua, assume_unique=True)
# find indices in ua, ub where to insert "other" unique values:
ssa = np.searchsorted(ua, diffba)
ssb = np.searchsorted(ub, diffab)
# throw away values that are too large:
ssa = ssa[np.where(ssa < len(ua))]
ssb = ssb[np.where(ssb < len(ub))]
# increment indices past previously computed "insert" positions:
for v in ssa[::-1]:
ia[ia >= v] += 1
for v in ssb[::-1]:
ib[ib >= v] += 1
# combine results:
uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
ic = np.concatenate([ia, ib])
return uc, ic
和numpy.unique()
进行比较:
uc
分割成两个以上的子阵列只需要额外的工作即可处理 - 只需保持累积“唯一”值和索引,如下所示:
ic