我编写了一个函数来规范机器学习算法中的一组特征。它采用矩形2D numpy数组features
并返回其正则化版本reg_features
(我使用来自Scikit的波士顿住房价格数据 - 用于培训)。确切的代码:
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
from pprint import pprint
def regularise(features):
# Regularised features:
reg_features = np.zeros(features.shape)
for x in range(len(features)):
for y in range(len(features[x])):
reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])
return reg_features
# Get the data
total_features, total_prices = load_boston(True)
# Keep 300 samples for training
train_features = regularise(total_features[:300]) # Works OK
train_prices = total_prices[:300]
# Keep 100 samples for validation
valid_features = regularise(total_features[300:400]) # Works OK
valid_prices = total_prices[300:400]
# Keep remaining samples as test set
test_features = regularise(total_features[400:]) # Does not work
test_prices = total_prices[400:]
请注意,对于regularise()
的最后一次调用,我才会收到此错误,该total_features[400:]
是regularise(total_features[400:])
:
/Users/RohanSaxena/Documents/projects/sdc/tensor/reg.py:11:RuntimeWarning:double_scalars中遇到无效值 reg_features [x] [y] =(features [x] [y] - np.mean(features [:,y]))/ np.std(features [:,y])
此代码的其余部分与最后一次调用有关,即for y in range(len(features[0])):
if np.std(features[:, y]) == 0.:
print(np.std(features[:, y])
要检查其中一个标准差是否为零,我这样做:
0.0
0.0
...
0.0
打印所有零,即:
features[0].size
总共features
次。这意味着for y in range(len(features[0])):
print(np.std(features[:, y])
中每列的标准偏差为零。
现在这看起来很奇怪。所以我打印每一个标准偏差都是肯定的:
10.9976293017
23.3483275632
6.63216140033
....
8.00329244499
我得到所有非零值:
if
这怎么可能?就在之前,以{{1}}条件为前缀,这个相同的代码给了我全部零,现在它给出了非零值!这对我没有任何意义。任何帮助表示赞赏。
答案 0 :(得分:1)
导致问题的是数据total_features[400:]
的子集。如果您查看该数据,您会看到列total_features[400:, 1]
和total_features[400:, 3]
都为0.这会导致代码出现问题,因为这些列的平均值和标准差都是0,结果为0/0。
您可以使用sklearn.preprocessing.scale
而不是编写自己的正规化功能。该函数通过返回全为0的列来处理常量列。
您可以轻松验证scale
是否与regularise
执行相同的计算:
In [68]: test
Out[68]:
array([[ 15., 1., 0.],
[ 3., 4., 5.],
[ 6., 7., 8.],
[ 9., 10., 11.],
[ 12., 13., 1.]])
In [69]: regularise(test)
Out[69]:
array([[ 1.41421356, -1.41421356, -1.20560706],
[-1.41421356, -0.70710678, 0. ],
[-0.70710678, 0. , 0.72336423],
[ 0. , 0.70710678, 1.44672847],
[ 0.70710678, 1.41421356, -0.96448564]])
In [70]: from sklearn.preprocessing import scale
In [71]: scale(test)
Out[71]:
array([[ 1.41421356, -1.41421356, -1.20560706],
[-1.41421356, -0.70710678, 0. ],
[-0.70710678, 0. , 0.72336423],
[ 0. , 0.70710678, 1.44672847],
[ 0.70710678, 1.41421356, -0.96448564]])
以下显示了函数如何处理一列零:
In [72]: test[:,2] = 0
In [73]: test
Out[73]:
array([[ 15., 1., 0.],
[ 3., 4., 0.],
[ 6., 7., 0.],
[ 9., 10., 0.],
[ 12., 13., 0.]])
In [74]: regularise(test)
/Users/warren/miniconda3/bin/ipython:9: RuntimeWarning: invalid value encountered in double_scalars
Out[74]:
array([[ 1.41421356, -1.41421356, nan],
[-1.41421356, -0.70710678, nan],
[-0.70710678, 0. , nan],
[ 0. , 0.70710678, nan],
[ 0.70710678, 1.41421356, nan]])
In [75]: scale(test)
Out[75]:
array([[ 1.41421356, -1.41421356, 0. ],
[-1.41421356, -0.70710678, 0. ],
[-0.70710678, 0. , 0. ],
[ 0. , 0.70710678, 0. ],
[ 0.70710678, 1.41421356, 0. ]])
答案 1 :(得分:0)
通常当发生这种情况时,首先猜测你是将分子除以大于它的int(而不是浮点数),因此结果为0.但是在这里看不到这种情况。
有时,除法不按照您的预期(按术语进行),而是按向量操作。 然而,这也不是这种情况。
此处的问题是您如何引用数据框
reg_features[x][y]
在处理数据框并将值重新定位到要使用函数loc
您可以在此处详细了解http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html