标准化数据后,scikit-learn fit()会导致错误

时间:2015-01-11 19:14:08

标签: python numpy pandas scikit-learn svm

我一直在尝试这个:

  1. 从数据集创建X要素和y依赖
  2. 拆分数据集
  3. 规范化数据
  4. 使用Scikit-learn
  5. 中的SVR进行训练

    以下是使用填充了随机值的pandas数据框的代码

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.rand(20,5), columns=["A","B","C","D", "E"])
    a = list(df.columns.values)
    a.remove("A")
    
    X = df[a]
    y = df["A"]
    
    X_train = X.iloc[0: floor(2 * len(X) /3)]
    X_test = X.iloc[floor(2 * len(X) /3):]
    y_train = y.iloc[0: floor(2 * len(y) /3)]
    y_test = y.iloc[floor(2 * len(y) /3):]
    
    # normalise
    
    from sklearn import preprocessing
    
    X_trainS = preprocessing.scale(X_train)
    X_trainN = pd.DataFrame(X_trainS, columns=a)
    
    X_testS = preprocessing.scale(X_test)
    X_testN = pd.DataFrame(X_testS, columns=a)
    
    y_trainS = preprocessing.scale(y_train)
    y_trainN = pd.DataFrame(y_trainS)
    
    y_testS = preprocessing.scale(y_test)
    y_testN = pd.DataFrame(y_testS)
    
    import sklearn
    from sklearn.svm import SVR
    
    clf = SVR(kernel='rbf', C=1e3, gamma=0.1)
    
    pred = clf.fit(X_trainN,y_trainN).predict(X_testN)
    

    给出了这个错误:

      

    C:\ Anaconda3 \ lib中\站点包\大熊猫\核心\ index.py:542:   FutureWarning:使用iloc时的切片索引器应该是整数和   不是浮点"而不是浮点",FutureWarning)   -------------------------------------------------- ------------------------- ValueError Traceback(最近一次调用   最后)in()        34 clf = SVR(内核=' rbf',C = 1e3,gamma = 0.1)        35   ---> 36 pred = clf.fit(X_trainN,y_trainN).predict(X_testN)        37

         

    C:\ Anaconda3 \ lib \ site-packages \ sklearn \ svm \ base.py in fit(self,X,y,   sample_weight)       174       175 seed = rnd.randint(np.iinfo(' i')。max)    - > 176 fit(X,y,sample_weight,solver_type,kernel,random_seed = seed)       177#请参阅此文件中对np.iinfo的另一个调用的注释       178

         _dense_fit中的

    C:\ Anaconda3 \ lib \ site-packages \ sklearn \ svm \ base.py(self,   X,y,sample_weight,solver_type,kernel,random_seed)       229 cache_size = self.cache_size,coef0 = self.coef0,       230 gamma = self._gamma,epsilon = self.epsilon,    - > 231 max_iter = self.max_iter,random_seed = random_seed)       232       233 self._warn_from_fit_status()

         

    C:\ Anaconda3 \ lib \ site-packages \ sklearn \ svm \ libsvm.pyd in   sklearn.svm.libsvm.fit(sklearn \ svm \ libsvm.c:1864)()

         

    ValueError:Buffer的维度数量错误(预期为1,得到2)

    我不确定为什么。谁能解释一下?我认为它可以在预处理后转换回数据帧。

1 个答案:

答案 0 :(得分:4)

此处的错误位于您作为标签传递的df中:y_trainN

如果您与sample docs版本和代码进行比较:

In [40]:

n_samples, n_features = 10, 5
np.random.seed(0)
y = np.random.randn(n_samples)
print(y)
y_trainN.values
[ 1.76405235  0.40015721  0.97873798  2.2408932   1.86755799 -0.97727788
  0.95008842 -0.15135721 -0.10321885  0.4105985 ]
Out[40]:
array([[-0.06680594],
       [ 0.23535043],
       [-1.49265082],
       [ 1.22537862],
       [-0.46499134],
       [-0.23744759],
       [ 1.40520679],
       [ 0.95882677],
       [ 1.66996413],
       [-0.37515955],
       [-0.75826444],
       [-1.45945337],
       [-0.63995369]])

因此,您可以调用squeeze来生成一个系列,也可以选择df中唯一的列,以便没有错误:

pred = clf.fit(X_trainN,y_trainN[0]).predict(X_testN)

pred = clf.fit(X_trainN,y_trainN.squeeze()).predict(X_testN)

所以我们可以争辩说,对于只有一个列的df,它应该返回一些可以强制转换为numpy数组的东西,或者numpy没有正确调用数组属性但实际上你应该传递一个系列或选择列从df作为参数