这是我在jupyternotebook中的代码 我很困惑,为什么我的输入形状输入错误。在我的代码中失败的行是在下面给出的,通过打开数据集文件并用于分割可能的 输出类高于50K或低于或等于50K。这个数据集略有不同 每个数据点都是数字和字符串混合的意义
with open(input_file, 'r') as f:
for line in f.readlines():
if '?' in line:
continue
data = line[:-1].split(', ')
if data[-1] == '<=50K' and count_lessthan50k < num_images_threshold:
X.append(data)
count_lessthan50k = count_lessthan50k + 1
elif data[-1] == '>50K' and count_morethan50k <
num_images_threshold:
X.append(data)
count_morethan50k = count_morethan50k + 1
if count_lessthan50k >= num_images_threshold and count_morethan50k>= num_images_threshold:
break
X = np.array(X)
这是将字符串数据转换为数字数据
label_encoder = []
X_encoded = np.empty(X.shape)
for i, item in enumerate(X[0]):
if item.isdigit():
X_encoded[:, i] = X[:, i]
else:
label_encoder.append(preprocessing.LabelEncoder())
X_encoded[:, i] = label_encoder[-1].fit_transform(X[:,i])
X = X_encoded[:, :-1].astype(int)
y = X_encoded[:, -1].astype(int)
交叉验证数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=5)
classifier_gaussiannb = GaussianNB()
classifier_gaussiannb.fit(X_train, y_train)
y_test_pred = classifier_gaussiannb.predict(X_test)
在单个数据实例上测试编码
input_data = ['39', 'State-gov', '77516', 'Bachelors', '13','Never-married', 'Adm-clerical', 'Not-in-family', 'White','Male', '2174', '0', '40', 'United-States']
count = 0
input_data_encoded = [-1] * len(input_data)
for i,item in enumerate(input_data):
if item.isdigit():
input_data_encoded[i] = int(input_data[i])
else:
input_data_encoded[i] = int(label_encoder[count].transform(input_data[i]))
count = count + 1
input_data_encoded = np.array(input_data_encoded)
我已经浏览了sklearn文档,但没有为我工作,任何帮助??
答案 0 :(得分:0)
LabelEncoder transform()
需要一次迭代所有样本进行转换,如documentation中所述: -
Transform labels to normalized encoding. Parameters y : array-like of shape [n_samples] Target values.
如果你想每次传递一个值,你需要将它包装在这样的列表中:
else:
input_data_encoded[i] = int(label_encoder[count].transform([input_data[i]]))
注意input_data[i]
附近的额外方括号。