pca和随机套索出错

时间:2017-06-24 17:21:28

标签: python machine-learning scikit-learn pca feature-detection

有两个.csv文件包含推文和每个推文的分类:posnegneutralclass表示分类,text表示推文。

这是我的代码:

def prediction():
    print("Reading files...")

    #Will learn from this data set.
    train = file2SentencesArray('twitter-sanders-apple3')

    #Test dataset.
    test = file2SentencesArray('twitter-sanders-apple2')
    print("Complete!")

    print("Cleaning sentences...")
    #cleanSenteces will remove html, stop words and some characters.
    cleanTrainSentences = cleanSentences(train["text"])
    cleanTestSentences = cleanSentences(test["text"])
    print("Complete!...")

    print("Fiting sentences...")
    vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)
    trainDataFeatures = vectorizer.fit_transform(cleanTrainSentences)
    np.asarray(trainDataFeatures)

    testDataFeatures = vectorizer.transform(cleanTestSentences)
    np.asarray(testDataFeatures)

    #Getting error here.
    randomized_lasso = RandomizedLasso()
    randomized_lasso.fit_transform(trainDataFeatures, testDataFeatures)
    trainDataFeatures = randomized_lasso.transform(trainDataFeatures)

    #and here.
    #pca = decomposition.PCA(n_components=2)
    #pca.fit_transform(trainDataFeatures)
    #trainDataFeatures = pca.transform(trainDataFeatures)
    print("Complete!")

    print("Predicting...")
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainDataFeatures, train["class"])
    result = forest.predict(testDataFeatures)
    print("Complete...")

    return result

随机套索和PCA都抛出异常:

PCA - PCA does not support sparse input.

随机套索 - bad input shape

我的trainDataFeatures看起来像这样:

(0, 573)   1
(0, 1411)  2
(0, 2748)  1
(0, 1073)  1
(1, 126)   1
(2, 1203)  1

1 个答案:

答案 0 :(得分:0)

PCA和Randomized Lasso的输入格式不正确。请更换以下两行,然后重试。

np.asarray(trainDataFeatures)
np.asarray(testDataFeatures)
# replace the above two lines with these
trainDataFeatures = trainDataFeatures.toarray()
testDataFeatures = testDataFeatures.toarray()