Question

我试图对此处的葡萄酒数据集进行分类 - http://archive.ics.uci.edu/ml/datasets/Wine+Quality 使用逻辑回归（使用method ='bfgs'和l1 norm）并捕获奇异值矩阵错误（提升LinAlgError（'奇异矩阵'），尽管满级[我使用np.linalg.matrix_rank进行测试（数据[ train_cols] .values）]。

这就是我得出的结论，即某些功能可能是其他功能的线性组合。为此，我尝试使用网格搜索/ LinearSVC - 我得到下面的错误，以及我的代码＆amp;数据集。

我可以看到只有6/7个功能实际上是“独立的” - 我在比较x_train_new [0]和x_train的行时会解释（所以我可以得到哪些列是多余的）

    # Train & test DATA CREATION
    from sklearn.svm import LinearSVC
    import numpy, random
    import pandas as pd
    df = pd.read_csv("https://github.com/ekta1007/Predicting_wine_quality/blob/master/wine_red_dataset.csv")
#,skiprows=0, sep=',')


    df=df.dropna(axis=1,how='any') # also tried how='all' - still get NaN errors as below
    header=list(df.columns.values) # or df.columns
    X = df[df.columns - [header[-1]]] # header[-1] = ['quality'] - this is to make the code genric enough
    Y = df[header[-1]] # df['quality']
    rows = random.sample(df.index, int(len(df)*0.7)) # indexing the rows that will be picked in the train set
    x_train, y_train = X.ix[rows],Y.ix[rows] # Fetching the data frame using indexes
    x_test,y_test  = X.drop(rows),Y.drop(rows)


# Training the classifier using C-Support Vector Classification.
clf = LinearSVC(C=0.01, penalty="l1", dual=False) #,tol=0.0001,fit_intercept=True, intercept_scaling=1)
clf.fit(x_train, y_train)
x_train_new = clf.fit_transform(x_train, y_train)
#print x_train_new #works
clf.predict(x_test) # does NOT work and gives NaN errors for some x_tests


clf.score(x_test, y_test) # Does NOT work
clf.coef_ # Works, but I am not sure, if this is OK, given huge NaN's - or does the coef's get impacted ?

clf.predict(x_train)
552   NaN
209   NaN
427   NaN
288   NaN
175   NaN
427   NaN
748     7
552   NaN
429   NaN
[... and MORE]
Name: quality, Length: 1119

clf.predict(x_test)
76    NaN
287   NaN
420     7
812   NaN
443     7
420     7
430   NaN
373     5
624     5
[..and More]
Name: quality, Length: 480

奇怪的是，当我运行clf.predict（x_train）时，我仍然看到一些NaN - 我做错了什么？毕竟模型是用这个训练的，这不应该发生，对吧？ /强>

根据这个帖子，我还检查了我的csv文件中没有空白（虽然我将“质量”重新标记为5和7标签（从范围（3,10）） How to fix "NaN or infinity" issue for sparse matrix in python?

此外 - 这是x_test＆amp;的数据类型y_test /火车...

x_test <class 'pandas.core.frame.DataFrame'> Int64Index: 480 entries, 1 to 1596 Data columns: alcohol 480 non-null values chlorides 480 non-null values citric acid 480 non-null values density 480 non-null values fixed acidity 480 non-null values free sulfur dioxide 480 non-null values pH 480 non-null values residual sugar 480 non-null values sulphates 480 non-null values total sulfur dioxide 480 non-null values volatile acidity 480 non-null values dtypes: float64(11) y_test 1 5 10 5 18 5 21 5 30 5 31 7 36 7 40 5 50 5 52 7 53 5 55 5 57 5 60 5 61 5 [..And MORE] Name: quality, Length: 480

最后..

clf.score(x_test, y_test) Traceback (most recent call last): File "<pyshell#31>", line 1, in <module> clf.score(x_test, y_test) File "C:\Python27\lib\site-packages\sklearn\base.py", line 279, in score return accuracy_score(y, self.predict(X)) File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 742, in accuracy_score y_true, y_pred = check_arrays(y_true, y_pred) File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 215, in check_arrays File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 18, in _assert_all_finite ValueError: Array contains NaN or infinity. #I also explicitly checked for NaN's as here -: for i in df.columns: df[i].isnull()

提示：还请提一下，根据我的使用案例，我对使用LinearSVC的思考过程是否正确，还是应该使用网格搜索？

免责声明：此代码的部分内容是基于StackOverflow和其他来源的类似上下文中的建议构建的 - 我的实际用例只是尝试访问此方法是否适合我的方案。就是这样。

Answer 1

这很有用。我必须真正改变的是使用x_test * .values *以及其余的pandas Dataframes（x_train，y_train，y_test）。正如所指出的，唯一的原因是pandas df和scikit-learn（使用numpy数组）之间的不兼容性

 #changing your Pandas Dataframe elegantly to work with scikit-learn by transformation to  numpy arrays
>>> type(x_test)
<class 'pandas.core.frame.DataFrame'>
>>> type(x_test.values)
<type 'numpy.ndarray'>

这个黑客来自这篇帖子http://python.dzone.com/articles/python-making-scikit-learn-and和@AndreasMueller--他指出了这种不一致。

ValueError：在LinearSVC期间，数组在_assert_all_finite中包含NaN或无穷大

1 个答案: