我正在创建垃圾邮件/火腿分类器。首先,我将所有电子邮件收录到矢量中。
然后,我使用sklearn的CountVectorizer
对所有邮件中的单词计数,得到以下矩阵:
>> print(vector.shape)
>> print(type(vector))
>> print(vector.toarray())
(2551, 48746)
<class 'scipy.sparse.csr.csr_matrix'>
[[2 0 1 ... 0 0 0]
[2 0 1 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[2 1 1 ... 0 0 0]
[2 0 0 ... 0 0 0]]
如果尝试将向量更改为DataFrame,我得到:
>> df_X = pd.DataFrame(vector.toarray())
0 1 2 3 4 5 6 7 8 ... 48737 48738 48739 48740 48741 48742 48743 48744 48745
0 2 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
1 2 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 ... 4 0 0 0 0 0 0 0 0
4 3 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
5
问题是我想为列指定有意义的名称(而不是0,1,2,...,48745)。
如果我运行print(vectorizer.vocabulary_)
,我会得到:
>> print(vectorizer.vocabulary_)
{u'74282760403': 10172, u'makinglight': 34440, u'localizes': 33864, u'sowell': 43338, u'e4c8b2940d2': 22109, u'juob22381': 32587, u'31c6d68fa597d411b04d00e02965883bd239fb': 7072, u'20020918154734': 5469, u'spiders': 43495, u'ftrain': 24856, u'hanging': 30009, u'woody': 48041, u'000093': 18, u'1a724ef5': 4703, u'05dc347c66': 1771, u'g93ba2f21504': 28071, u'g16mteg13192': 25103, u'7f08f1c2c4': 10578, u'g974xhk18362': 28334, u'g85bc1j10899': 26181,...}
这是完整的代码:
import os,glob
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
folder_path = 'easy_ham/'
files_text_arr = []
files_text_arr_y = []
for filename in glob.glob(os.path.join(folder_path, '*')):
with open(filename, 'r') as f:
text = f.read()
files_text_arr.append(text)
files_text_arr_y.append(0)
vectorizer = CountVectorizer(encoding='latin-1')
vectorizer.fit(files_text_arr)
vector = vectorizer.transform(files_text_arr)
print(vector.shape)
print(type(vector))
print(vector.toarray())
#print(vectorizer.vocabulary_)
df_X = pd.DataFrame(vector.toarray())
df_y = pd.DataFrame({'spam':files_text_arr_y})
print(df_X)
如何将列的名称更改为电子邮件中的单词?
P.S。我使用来自this website的电子邮件。
答案 0 :(得分:2)
您可以使用方法get_feature_names()
,然后将其分配给由toarray()
方法的输出创建的数据框的列。
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
输出
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
df = pd.DataFrame(X.toarray())
df.columns = vectorizer.get_feature_names()
df
输出
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1