Sklearn CountVectorizer功能元素

时间:2020-07-10 07:10:46

标签: python pandas data-science data-analysis countvectorizer

我对CountVectorizer有问题,因为它会将单词混在一起。我想念什么?

这就是我所做的:

raw = open('smsspam').read() 

UCI SMS Spam dataset

corpse = raw.replace('\t', '\n').split('\n')
parsed = [i for i in corpse if i]

#Select features
labels = parsed[0::2]
msg = parsed[1::2]

import pandas as pd
pd.set_option('display.max_colwidth', 100)
df = pd.DataFrame({'Label' : labels,'SMS' : msg})

#LEMMATIZING

import string
import re
stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()
wn.lemmatize('word')

def lemmatizer(text):
text = "".join([i for i in text if i not in string.punctuation])
text = re.split('\W+', text)
text = [i for i in text if i not in stopwords]
text = [wn.lemmatize(i) for i in text]
return text
df['Stems'] = df['SMS'].apply(lambda x: lemmatizer(x.lower()))
df.head(10)

Screen shot

https://ibb.co/fMSxFw8]

#Vectorization
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer=lemmatizer)
x = cv.fit_transform(df['Stems'])

df2 = pd.DataFrame(x.toarray(), columns=cv.get_feature_names())
df2.tail()

screen shot

https://ibb.co/DKVSzdN

我在这里想念什么?

0 个答案:

没有答案