如何计算数据集中正负句子的总数?

时间:2018-06-13 23:33:52

标签: python-3.x scikit-learn

我想从数据集中获得正面句子和否定句子的总数,我对其进行了测试。那么如何计算正面和负面句子的总数呢?

import sklearn
from sklearn.datasets import load_files
moviedirt = r'C:\\Users\\premier\\Downloads\\Reviews\\test'
movie_test = load_files(moviedirt , shuffle=True)
movie_test.target_names
movie_test.data[0:10000]
from sklearn.pipeline import Pipeline # use pipeline for feature extraction and algorithm
pipeline = Pipeline([('vect',CountVectorizer(stop_words='english')), 
('tfidf',TfidfTransformer()),('clf',MultinomialNB(fit_prior=False))])
clf = pipeline.fit(movie_train.data , movie_train.target) # classifier is train  
predict1 = clf.predict(movie_test.data)
for review, category in zip(movie_test.data , predict1): #use loop 
print('%r => %s' % (review, movie_train.target_names[category])) 

这是完整的测试代码。 这是输出:

b"Don't hate Heather Graham because she's beautiful, hate her because she's 
fun to watch in this movie. Like the hip clothing and funky surroundings, the 
actors in this flick work well together. Casey Affleck is hysterical and 
Heather Graham literally lights up the screen. The minor characters - Goran 
Visnjic {sigh} and Patricia Velazquez are as TALENTED as they are gorgeous. 
Congratulations Miramax & Director Lisa Krueger!" => pos

b'I don\'t know how this movie has received so many positive comments. One 
can call it "artistic" and "beautifully filmed", but those things don\'t make 
up for the empty plot that was filled with sexual innuendos. I wish I had not 
wasted my time to watch this movie. Rather than being biographical, it was a 
poor excuse for promoting strange and lewd behavior. It was just another 
Hollywood attempt to convince us that that kind of life is normal and OK. 
From the very beginning I asked my self what was the point of this movie,and 
I continued watching, hoping that it would change and was quite disappointed 
that it continued in the same vein. I am so glad I did not spend the money to 
see this in a theater!' => neg

1 个答案:

答案 0 :(得分:0)

import numpy as np

# Number of pos/neg samples in your training set
print(np.unique(movie_train.target, return_counts=True))

# Number of pos/neg samples in your predictions
print(np.unique(predict1, return_counts=True))