我整天都被困住了。遵循this教程,他们在Sklearn上使用公开的收入数据集实施GaussianNB
。
我要弄清楚的是,一旦训练好模型,我就能即时做出预测。
数据集中的示例:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
他们的代码:
# Required Python Machine learning Packages
import pandas as pd
import numpy as np
# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
# To split the dataset into train and test datasets
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
import os
from sklearn.model_selection import train_test_split
adult_df = pd.read_csv(os.path.join(os.path.dirname(__file__), "adult.data"),
header=None, delimiter=' *, *', engine='python')
adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss',
'hours_per_week', 'native_country', 'income']
adult_df_rev = adult_df
adult_df_rev.describe(include='all')
for value in ['workclass', 'education',
'marital_status', 'occupation',
'relationship', 'race', 'sex',
'native_country', 'income']:
adult_df_rev[value].replace(['?'], [adult_df_rev.describe(include='all')[value][2]],
inplace=True)
le = preprocessing.LabelEncoder()
workclass_cat = le.fit_transform(adult_df.workclass)
education_cat = le.fit_transform(adult_df.education)
marital_cat = le.fit_transform(adult_df.marital_status)
occupation_cat = le.fit_transform(adult_df.occupation)
relationship_cat = le.fit_transform(adult_df.relationship)
race_cat = le.fit_transform(adult_df.race)
sex_cat = le.fit_transform(adult_df.sex)
native_country_cat = le.fit_transform(adult_df.native_country)
# initialize the encoded categorical columns
adult_df_rev['workclass_cat'] = workclass_cat
adult_df_rev['education_cat'] = education_cat
adult_df_rev['marital_cat'] = marital_cat
adult_df_rev['occupation_cat'] = occupation_cat
adult_df_rev['relationship_cat'] = relationship_cat
adult_df_rev['race_cat'] = race_cat
adult_df_rev['sex_cat'] = sex_cat
adult_df_rev['native_country_cat'] = native_country_cat
# drop the old categorical columns from dataframe
dummy_fields = ['workclass', 'education', 'marital_status',
'occupation', 'relationship', 'race',
'sex', 'native_country']
adult_df_rev = adult_df_rev.drop(dummy_fields, axis=1)
adult_df_rev = adult_df_rev.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',
'education_num', 'marital_cat', 'occupation_cat',
'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',
'capital_loss', 'hours_per_week', 'native_country_cat',
'income'], axis=1)
adult_df_rev.head(1)
num_features = ['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',
'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',
'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',
'native_country_cat']
scaled_features = {}
for each in num_features:
mean, std = adult_df_rev[each].mean(), adult_df_rev[each].std()
scaled_features[each] = [mean, std]
adult_df_rev.loc[:, each] = (adult_df_rev[each] - mean)/std
features = adult_df_rev.values[:, :14]
target = adult_df_rev.values[:, 14]
features_train, features_test, target_train, target_test = train_test_split(features,
target, test_size=0.33, random_state=10)
clf = GaussianNB()
clf.fit(features_train, target_train)
target_pred = clf.predict(features_test)
print(target_pred)
print(accuracy_score(target_test, target_pred, normalize = True))
假设我现在从某处(39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
)获得以下数据,并且希望使用该模型进行预测,请您指导我如何执行此操作?
我尝试过
clf.predict([[39, "State-gov", 77516, "Bachelors", 13, "Never-married", "Adm-clerical", "Not-in-family", "White", "Male", 2174, 0, 40, "United-States", "<=50K"]])
,但除了很多错误之外,什么也没有。