我正在尝试与sagemaker进行线性回归。我的矩阵有一些空值,因此线性学习器算法失败。我可以做些什么使算法处理空值吗?
下面的矩阵数据:
array([[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
1.7883900e+05, 9.6533337e+00],
[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
4.9014000e+04, 1.3181389e+01],
[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
1.2483900e+05, 1.1561944e+01],
...,
[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
4.7306000e+04, 1.8681944e+01],
[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 0.0000000e+00,
1.3530000e+04, 1.1964444e+01],
[0.0000000e+00, nan, nan, ..., 0.0000000e+00,
8.4100000e+03, 1.8925833e+01]], dtype=float32)
from sagemaker import get_execution_role
role = get_execution_role()
linear = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
role,
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=output_location,
sagemaker_session=sess)
#Model Parameters
linear.set_hyperparameters(feature_dim=25,
predictor_type='regressor',
normalize_data=False)
linear.fit({'train': s3_train_data})
linear_predictor = linear.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')
Blockquote
输出:
2019-08-16 12:40:21 Starting - Starting the training job...
2019-08-16 12:40:24 Starting - Launching requested ML instances......
2019-08-16 12:41:23 Starting - Preparing the instances for training......
2019-08-16 12:42:34 Downloading - Downloading input data...
2019-08-16 12:43:15 Training - Training image download completed. Training in progress.
2019-08-16 12:43:15 Uploading - Uploading generated training model
2019-08-16 12:43:15 Failed - Training job failed
UnexpectedStatusException: Error for Training job linear-learner-2019-08-16-12-40-21-312: Failed. Reason: ClientError: Unable to read data channel 'train'. Found missing (NaN) values. Please remove any missing (NaN) values in the input data. (caused by MXNetError)
Caused by: [12:43:11] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.1649.0/AL2012/generic-flavor/src/src/aialgs/io/iterator_base.cpp:103: (Input Error) (NaN) NaN value encountered in the dataset.
答案 0 :(得分:0)
不幸的是,您将必须清除任何包含缺少值的行,以便SageMaker能够处理它们。
看到您的情况看起来像连续值时,最好的选择是删除它们。
如果空值对您的数据很重要,则可以尝试将这些列分解为离散值,以使空值成为数据的一部分。这将取决于数据,如果这些列的变化很大,则不建议这样做。