有两个.csv文件包含推文和每个推文的分类:pos
,neg
和neutral
。 class
表示分类,text
表示推文。
这是我的代码:
def prediction():
print("Reading files...")
#Will learn from this data set.
train = file2SentencesArray('twitter-sanders-apple3')
#Test dataset.
test = file2SentencesArray('twitter-sanders-apple2')
print("Complete!")
print("Cleaning sentences...")
#cleanSenteces will remove html, stop words and some characters.
cleanTrainSentences = cleanSentences(train["text"])
cleanTestSentences = cleanSentences(test["text"])
print("Complete!...")
print("Fiting sentences...")
vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)
trainDataFeatures = vectorizer.fit_transform(cleanTrainSentences)
np.asarray(trainDataFeatures)
testDataFeatures = vectorizer.transform(cleanTestSentences)
np.asarray(testDataFeatures)
#Getting error here.
randomized_lasso = RandomizedLasso()
randomized_lasso.fit_transform(trainDataFeatures, testDataFeatures)
trainDataFeatures = randomized_lasso.transform(trainDataFeatures)
#and here.
#pca = decomposition.PCA(n_components=2)
#pca.fit_transform(trainDataFeatures)
#trainDataFeatures = pca.transform(trainDataFeatures)
print("Complete!")
print("Predicting...")
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(trainDataFeatures, train["class"])
result = forest.predict(testDataFeatures)
print("Complete...")
return result
随机套索和PCA都抛出异常:
PCA - PCA does not support sparse input.
随机套索 - bad input shape
我的trainDataFeatures
看起来像这样:
(0, 573) 1
(0, 1411) 2
(0, 2748) 1
(0, 1073) 1
(1, 126) 1
(2, 1203) 1
答案 0 :(得分:0)
PCA和Randomized Lasso的输入格式不正确。请更换以下两行,然后重试。
np.asarray(trainDataFeatures)
np.asarray(testDataFeatures)
# replace the above two lines with these
trainDataFeatures = trainDataFeatures.toarray()
testDataFeatures = testDataFeatures.toarray()