我进行了一些正面和负面的评价,以了解预处理后countvectorizer的工作原理是什么?
['I like this movie awesome',
'never saw movie like this',
'This is my favorite movie',
'Script is marvellous',
'This is one of the worst movie',
'I hate the movie',
'Bad script',
'2 hours is wasted',
'Nothing new in the movie']
转换为小写
X=[]
for string in x:
string=string.lower()
X.append(string)
['i like this movie awesome',
'never saw movie like this',
'this is my favorite movie',
'script is marvellous',
'this is one of the worst movie',
'i hate the movie',
'bad script',
'2 hours is wasted',
'nothing new in the movie']
使用countvectorizer后,我得到了数组
from sklearn.feature_extraction.text import CountVectorizer vect=CountVectorizer()
x=vect.fit_transform(x).toarray()
print(vect.get_feature_names())
print(x)
['awesome',
'bad',
'favorite',
'hate',
'hours',
'in',
'is',
'like',
'marvellous',
'movie',
'my',
'never',
'new',
'nothing',
'of',
'one',
'saw',
'script',
'the',
'this',
'wasted',
'worst']
array([[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]],
dtype=int64)`
其中,数组中的每一行是原始文档(字符串)之一,每一列是要素(单词),元素是该特定单词和文档的计数。您会看到,如果对每一列求和,则会得到正确的数字
但是在标记化,词干和词形化之后
from nltk.tokenize import word_tokenize
x=[]
for string in X:
x=x+[word_tokenize(string)]
[['i', 'like', 'this', 'movie', 'awesome'],
['never', 'saw', 'movie', 'like', 'this'],
['this', 'is', 'my', 'favorite', 'movie'],
['script', 'is', 'marvellous'],
['this', 'is', 'one', 'of', 'the', 'worst', 'movie'],
['i', 'hate', 'the', 'movie'],
['bad', 'script'],
['2', 'hours', 'is', 'wasted'],
['nothing', 'new', 'in', 'the', 'movie']]`
类似的词干和词根化
将此输入提供给Countvectorizer会出错并能够理解
我是否必须再次将内部列表转换为document(string)?
如果是,那么预处理的好处是获取附近的单词,现在这个附近的单词成为CountVectorizer的功能
应用预处理后
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
word_lem=WordNetLemmatizer()
x=[]
for document in X:
temp=''
for word in document:
temp=temp+' '+str(word_lem.lemmatize(word))
x.append(str(temp))`
[' i like thi movi awesom',
' never saw movi like thi',
' thi is my favorit movi',
' script is marvel',
' thi is one of the worst movi',
' i hate the movi',
' bad script',
' 2 hour is wast',
' noth new in the movi']
得到这样的输出。