我在使用基于树的基于python的算法时遇到问题: 这是我的火车功能:
# The function to execute the training.
def train():
print('Starting the training.')
try:
# Take the set of files and read them all into a single pandas dataframe
input_files = [ os.path.join(training_path, file) for file in os.listdir(training_path) ]
if len(input_files) == 0:
raise ValueError(('There are no files in {}.\n' +
'This usually indicates that the channel ({}) was incorrectly specified,\n' +
'the data specification in S3 was incorrectly specified or the role specified\n' +
'does not have permission to access the data.').format(training_path, channel_name))
raw_data = [ pd.read_csv(file, header=0) for file in input_files ]
train_data = pd.concat(raw_data)
# labels are in the first column
train_y = train_data.ix[:,0]
train_X = train_data.ix[:,1:]
# Now use scikit-learn's decision tree classifier to train the model.
clf = tree.DecisionTreeClassifier(max_leaf_nodes=50)
clf = clf.fit(train_X, train_y)
# save the model
with open(os.path.join(model_path, 'tree-model.pkl'), 'wb') as out:
pickle.dump(clf, out, protocol=0)
print('Training complete.')
except Exception as e:
# Write out an error file. This will be returned as the failureReason in the
# DescribeTrainingJob result.
trc = traceback.format_exc()
with open(os.path.join(output_path, 'failure'), 'w') as s:
s.write('Exception during training: ' + str(e) + '\n' + trc)
# Printing this causes the exception to be in the training job logs, as well.
print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
# A non-zero exit code causes the training job to be marked as Failed.
sys.exit(255)
if __name__ == '__main__':
train()
# A zero exit code causes the job to be marked a Succeeded.
sys.exit(0)
我使用此数据集进行了训练:
setosa 5.1 3.5 1.4 0.2
0 setosa 4.9 3.0 1.4 0.2
1 setosa 4.7 3.2 1.3 0.2
2 setosa 4.6 3.1 1.5 0.2
3 setosa 5.0 3.6 1.4 0.2
4 setosa 5.4 3.9 1.7 0.4
5 setosa 4.6 3.4 1.4 0.3
但是当我尝试使用测试数据预测价值时:
import itertools
a = [20*i for i in range(3)]
b = [10+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]
test_data=shape.iloc[indices[:-1]]
test_X=test_data.iloc[:,1:]
test_y=test_data.iloc[:,0]
test_X.values
array([[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4]])
我收到此错误消息:
ValueError: Number of features of the model must match the input. Model n_features is 4 and input n_features is 3
很奇怪,我在测试和培训数据中看到有4种特征,我不知道为什么它只能识别3种
能帮我解决这个问题吗?
谢谢