从数组和列表中获取各种令牌计数统计信息的更有效方法

时间:2017-08-21 10:56:22

标签: python arrays scikit-learn countvectorizer

我正在从电子邮件文本列表中分类垃圾邮件(以csv格式存储),但在我这样做之前,我想从输出中获取一些简单的统计数据。我使用sklearn中的CountVectorizer作为第一步,并通过以下代码实现

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

#import data from csv

spam = pd.read_csv('spam.csv')
spam['Spam'] = np.where(spam['Spam']=='spam',1,0)

#split data

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

#convert 'features' to numeric and then to matrix or list
cv = CountVectorizer()
x_traincv = cv.fit_transform(X_train)
a = x_traincv.toarray()
a_list = cv.inverse_transform(a)

输出存储在矩阵(名为“a”)或数组列表(名为“a_list”)的格式中,如下所示

[array(['do', 'I', 'off', 'text', 'where', 'you'], 
       dtype='<U32'),
 array(['ages', 'will', 'did', 'driving', 'have', 'hello', 'hi', 'hol', 'in', 'its', 'just', 'mate', 'message', 'nice', 'off', 'roads', 'say', 'sent', 'so', 'started', 'stay'], dtype='<U32'),      
       ...
 array(['biz', 'for', 'free', 'is', '1991', 'network', 'operator', 'service', 'the', 'visit'], dtype='<U32')]

但我发现从这些输出中获取一些简单的计数统计数据有点困难,例如最长/最短令牌,令牌的平均长度等。如何从矩阵或列表输出中获取这些简单的计数统计数据我生成了?

1 个答案:

答案 0 :(得分:2)

您可以将令牌,令牌计数和令牌长度加载到新的Pandas数据框中,然后执行自定义查询。

这是一个玩具数据集的简单示例。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish","dog cat cat","fish bird walrus monkey","bird lizard"]

cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
# https://stackoverflow.com/a/16078639/2491761
tokens_and_counts = zip(cv.get_feature_names(), np.asarray(cv_fit.sum(axis=0)).ravel())

df = pd.DataFrame(tokens_and_counts, columns=['token', 'count'])

df['length'] = df.token.str.len() # https://stackoverflow.com/a/29869577/2491761

# all the tokens with length equal to min token length:
df.loc[df['length'] == df['length'].min(), 'token']

# all the tokens with length equal to max token length:
df.loc[df['length'] == df['length'].max(), 'token']

# all tokens with length less than mean token length:
df.loc[df['length'] < df['length'].mean(), 'token']

# all tokens with length greater than 1 standard deviation from the mean:
df.loc[df['length'] > df['length'].mean() + df['length'].std(), 'token']

如果您想根据计数进行查询,可以轻松扩展。