我已经阅读了很多关于此特定错误的内容,但未能找到解决我的问题的答案。我有一个数据集,我已分成火车和测试集,我正在寻找运行KNeighborsClassifier。我的代码如下......我的问题是,当我查看我的X_train的dtypes时,我根本看不到任何字符串格式的列。我的y_train是一个分类变量。这是我的第一篇stackoverflow帖子,所以我很抱歉,如果我忽略了任何手续并感谢您的帮助! :)
错误:
TypeError: unorderable types: str() > float()
Dtypes:
X_train.dtypes.value_counts()
Out[54]:
int64 2035
float64 178
dtype: int64
代码:
# Import Packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.dummy import DummyRegressor
from sklearn.cross_validation import train_test_split, KFold
from matplotlib.ticker import FormatStrFormatter
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pdb
# Set Directory Path
path = "file_path"
os.chdir(path)
#Select Import File
data = 'RawData2.csv'
delim = ','
#Import Data File
df = pd.read_csv(data, sep = delim)
print (df.head())
df.columns.get_loc('Categories')
#Model
#Select/Update Features
X = df[df.columns[14:2215]]
#Get Column Index for Target Variable
df.columns.get_loc('Categories')
#Select Target and fill na's with "Small" label
y = y[y.columns[21]]
print(y.values)
y.fillna('Small')
#Training/Test Set
X_sample = X.loc[X.Var1 <1279]
X_valid = X.loc[X.Var1 > 1278]
y_sample = y.head(len(X_sample))
y_test = y.head(len(y)-len(X_sample))
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size = 0.2)
cv = KFold(n = X_train.shape[0], n_folds = 5, random_state = 17)
print(X_train.shape, y_train.shape)
X_train.dtypes.value_counts()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train) **<-- This is where the error is flagged**
accuracy_score(knn.predict(X_test))
答案 0 :(得分:0)
sklearn中的所有内容都基于numpy,它只使用数字。因此,分类X和Y需要编码为数字。对于x,您可以使用get_dummies。对于y,您可以使用LabelEncoder。
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html