Question

我有一个numpy数据集，带有x向量和y向量。 y向量只有两个值+1或-1（或0或1），因为它是二进制值函数。我知道我可以循环遍历数据集，如果我看到+1将其映射到1，如果我看到并且-1将它逐个映射到0。但是，我希望给定整个向量y = [N x 1]将其一步映射到向量y = [N x 2]，因为可能非常大，我想尽快做到（我也没有＆＃39;我想两次保存数据集的副本）。

有没有一种矢量化的方法可以在python中快速完成这个转换？

这里的参考是循环代码：

def transform_data_to_one_hot(X,Y):
    N,D = Y.size
    Y_new = np.zeros(N,D)
    for i in range(N):
        if y == -1:
            Y_new[i] = np.array([1,0])
        else:
            Y_new[i] = np.array([0,1])
    return Y_new

让我们使用Radamacher变量进行奇偶校验功能（即+ 1，-1而不是0和1）。在这种情况下，奇偶校验功能只是产品功能：

>>> X = np.array([[-1,-1],[-1,1],[1,-1],[1,1]])
>>> X
array([[-1, -1],
       [-1,  1],
       [ 1, -1],
       [ 1,  1]])

>>> Y = np.reshape(np.prod(X,axis=1),[4,1])
>>> Y
array([[ 1],
       [-1],
       [-1],
       [ 1]])

Y矢量时一个热点应该是：

>>> Y
array([[ 0,1],
       [1,0],
       [1,0],
       [ 0,1]])

Answer 1

一些简单的观察结果可以提高效率：

预先分配结果，而不是使用concatenate
empty比zeros更快，如果您要覆盖这些零
使用out参数，以避免临时使用

def sign_to_one_hot(x, dtype=np.float64):
    out = np.empty(x.shape + (2,), dtype=dtype)
    plus_one = out[...,0]
    minus_one = out[...,1]
    np.equal(x, 1, out=plus_one)
    np.subtract(1, plus_one, out=minus_one)
    return out

仔细选择您的dtype - 因为您选择了错误的dtype会产生副本

Answer 2

这是一个基于初始化的 -

def initialization_based(y):
    out = np.zeros((len(y),2),dtype=int)
    out[np.arange(out.shape[0]), (y==1).astype(int)] = 1
    return out

示例运行 -

In [244]: y
Out[244]: array([ 1, -1,  1,  1, -1,  1, -1,  1])

In [245]: initialization_based(y)
Out[245]: 
array([[0, 1],
       [1, 0],
       [0, 1],
       [0, 1],
       [1, 0],
       [0, 1],
       [1, 0],
       [0, 1]])

使用initialization方法的其他方式 -

def initialization_based_v2(y):
    out = np.zeros((len(y),2),dtype=int)
    out[np.arange(out.shape[0]), (y+1)//2] = 1
    return out

def initialization_based_v3(y):
    yc = y.copy()
    yc[yc==-1] = 0
    out = np.zeros((len(y),2),dtype=int)
    out[np.arange(out.shape[0]), yc] = 1
    return out

这两个新增内容仅在我们设置列索引的方式上有所不同。对于版本2，我们只使用(y+1)//2进行计算，而将版本3计算为：yc = y.copy(); yc[yc==-1] = 0。

另一个非常接近@Eric's one，但使用布尔数组 -

def initialization_based_v4(y):
    out = np.empty((len(y),2),dtype=int)
    mask = y == 1    
    out[:,0] = mask
    out[:,1] = ~mask
    return out

运行时测试 -

In [320]: y = 2*np.random.randint(0,2,(1000000))-1

In [321]: %timeit sign_to_one_hot(y, dtype=int)
     ...: %timeit initialization_based(y)
     ...: %timeit initialization_based_v2(y)
     ...: %timeit initialization_based_v3(y)
     ...: %timeit initialization_based_v4(y)
     ...: 
100 loops, best of 3: 3.16 ms per loop
100 loops, best of 3: 8.39 ms per loop
10 loops, best of 3: 27.2 ms per loop
100 loops, best of 3: 13.8 ms per loop
100 loops, best of 3: 3.11 ms per loop

In [322]: from sklearn.preprocessing import OneHotEncoder

In [323]: enc = OneHotEncoder(sparse=False)

In [324]: %timeit enc.fit_transform(np.where(y>=0, y, 0))
10 loops, best of 3: 77.3 ms per loop

Answer 3

您也可以使用sklearn.preprocessing.OneHotEncoder方法。

注意：它不接受负数，因此我们必须更换它们。

演示：

from sklearn.preprocessing import OneHotEncoder

# per default it generates sparsed matrix - it might be very useful for huge data sets    
enc = OneHotEncoder(sparse=False)

rslt =  enc.fit_transform(np.where(Y>=0, Y, 0))

结果：

In [140]: rslt
Out[140]:
array([[ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.]])

源数组：

In [141]: Y
Out[141]:
array([[ 1],
       [-1],
       [-1],
       [ 1]])

熊猫解决方案：

In [148]: pd.get_dummies(Y.ravel())
Out[148]:
   -1   1
0   0   1
1   1   0
2   1   0
3   0   1

如何在Python中以矢量化的方式将具有两个标签+1和-1的数据集转换为热的一个矢量表示？

3 个答案: