使用Pandas编辑.csv中的整行并输入KNeighborsClassifier

时间:2018-02-06 15:13:25

标签: python pandas csv scikit-learn knn

我是应用机器学习的新手,有一个数据集,其中包含巧克力中可可含百分比的可可。但是当我将该列提供给KNeighborsClassifer的fit()函数时,它会抛出以下错误;

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

choco = pd.read_csv('flavors_of_cacao.csv')

X = choco['Cocoa']
y = choco['Name']

X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0)

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)

我的代码就是这个;

Traceback (most recent call last):
  File "/home/himanshu/ML Tut-2/ML_tut2.py", line 14, in <module>
    knn.fit(X_train, y_train)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 765, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 552, in check_X_y
    check_consistent_length(X, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1346, 449]

很明显,fit()函数在可可列中需要一个浮点数,但是它得到了&#39;%&#39;符号以及无法操纵时无法转换为浮点数的数字。

请帮我解决这个问题。

编辑:

我已修改了我的CSV并删除了&#39;%&#39;来自它的迹象,但现在我收到以下错误;

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np

df = pd.read_csv('chocos.csv')

X = df[['Cocoa']]
y = df['Name']

X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0)

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)

新代码是;

.body{ max-height: 'your max px or %'; 
 max-width:'your max px or %';}

供参考,数据集为this

1 个答案:

答案 0 :(得分:0)

只使用该列中的值而不使用百分号:

X = [[float(val.replace('%',''))] for val in choco['Cocoa']]