Question

我编写了一个函数来规范机器学习算法中的一组特征。它采用矩形2D numpy数组features并返回其正则化版本reg_features（我使用来自Scikit的波士顿住房价格数据 - 用于培训）。确切的代码：

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
from pprint import pprint

def regularise(features):

    # Regularised features:
    reg_features = np.zeros(features.shape)

    for x in range(len(features)):
        for y in range(len(features[x])):

            reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])

    return reg_features

# Get the data
total_features, total_prices = load_boston(True)

# Keep 300 samples for training
train_features = regularise(total_features[:300])        # Works OK
train_prices = total_prices[:300]

# Keep 100 samples for validation
valid_features = regularise(total_features[300:400])     # Works OK
valid_prices = total_prices[300:400]

# Keep remaining samples as test set
test_features = regularise(total_features[400:])         # Does not work
test_prices = total_prices[400:]

请注意，对于regularise()的最后一次调用，我才会收到此错误，该total_features[400:]是regularise(total_features[400:])：

/Users/RohanSaxena/Documents/projects/sdc/tensor/reg.py:11:RuntimeWarning：double_scalars中遇到无效值 reg_features [x] [y] =（features [x] [y] - np.mean（features [：，y]））/ np.std（features [：，y]）

此代码的其余部分与最后一次调用有关，即for y in range(len(features[0])): if np.std(features[:, y]) == 0.: print(np.std(features[:, y])

要检查其中一个标准差是否为零，我这样做：

0.0
0.0
...
0.0

打印所有零，即：

features[0].size

总共features次。这意味着

for y in range(len(features[0])):
    print(np.std(features[:, y])

中每列的标准偏差为零。

现在这看起来很奇怪。所以我打印每一个标准偏差都是肯定的：

10.9976293017
23.3483275632
6.63216140033
....
8.00329244499

我得到所有非零值：

if

这怎么可能？就在之前，以{{1}}条件为前缀，这个相同的代码给了我全部零，现在它给出了非零值！这对我没有任何意义。任何帮助表示赞赏。

Answer 1

导致问题的是数据total_features[400:]的子集。如果您查看该数据，您会看到列total_features[400:, 1]和total_features[400:, 3]都为0.这会导致代码出现问题，因为这些列的平均值和标准差都是0，结果为0/0。

您可以使用sklearn.preprocessing.scale而不是编写自己的正规化功能。该函数通过返回全为0的列来处理常量列。

您可以轻松验证scale是否与regularise执行相同的计算：

In [68]: test
Out[68]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   5.],
       [  6.,   7.,   8.],
       [  9.,  10.,  11.],
       [ 12.,  13.,   1.]])

In [69]: regularise(test)
Out[69]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

In [70]: from sklearn.preprocessing import scale

In [71]: scale(test)
Out[71]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

以下显示了函数如何处理一列零：

In [72]: test[:,2] = 0

In [73]: test
Out[73]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   0.],
       [  6.,   7.,   0.],
       [  9.,  10.,   0.],
       [ 12.,  13.,   0.]])

In [74]: regularise(test)
/Users/warren/miniconda3/bin/ipython:9: RuntimeWarning: invalid value encountered in double_scalars
Out[74]: 
array([[ 1.41421356, -1.41421356,         nan],
       [-1.41421356, -0.70710678,         nan],
       [-0.70710678,  0.        ,         nan],
       [ 0.        ,  0.70710678,         nan],
       [ 0.70710678,  1.41421356,         nan]])

In [75]: scale(test)
Out[75]: 
array([[ 1.41421356, -1.41421356,  0.        ],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.        ],
       [ 0.        ,  0.70710678,  0.        ],
       [ 0.70710678,  1.41421356,  0.        ]])

Answer 2

通常当发生这种情况时，首先猜测你是将分子除以大于它的int（而不是浮点数），因此结果为0.但是在这里看不到这种情况。

有时，除法不按照您的预期（按术语进行），而是按向量操作。然而，这也不是这种情况。

此处的问题是您如何引用数据框

reg_features[x][y]

在处理数据框并将值重新定位到要使用函数loc

的特定单元格时

您可以在此处详细了解http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html

Numpy的标准差方法给出除零误差

2 个答案: