我是应用机器学习的新手,有一个数据集,其中包含巧克力中可可含百分比的可可。但是当我将该列提供给KNeighborsClassifer的fit()函数时,它会抛出以下错误;
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
choco = pd.read_csv('flavors_of_cacao.csv')
X = choco['Cocoa']
y = choco['Name']
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
我的代码就是这个;
Traceback (most recent call last):
File "/home/himanshu/ML Tut-2/ML_tut2.py", line 14, in <module>
knn.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 765, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 552, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1346, 449]
很明显,fit()函数在可可列中需要一个浮点数,但是它得到了&#39;%&#39;符号以及无法操纵时无法转换为浮点数的数字。
请帮我解决这个问题。
编辑:
我已修改了我的CSV并删除了&#39;%&#39;来自它的迹象,但现在我收到以下错误;
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
df = pd.read_csv('chocos.csv')
X = df[['Cocoa']]
y = df['Name']
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0)
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
新代码是;
.body{ max-height: 'your max px or %';
max-width:'your max px or %';}
供参考,数据集为this。
答案 0 :(得分:0)
只使用该列中的值而不使用百分号:
X = [[float(val.replace('%',''))] for val in choco['Cocoa']]