目前我正在尝试通过计算600个文件(400封电子邮件和200封垃圾邮件)中单词的出现来处理lingspam dataset
。我已经使用Porter Stemmer
Aglorithm使每个单词具有通用性,我还希望我的结果在每个文件中标准化以便进一步处理。但我不确定如何才能做到这一点..
到目前为止的资源
为了获得下面的输出,我需要能够按升序添加文件中可能不存在的项目。
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2, 'univers', 0, 'sales', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0, 'univers', 0, 'sales', 1)]
然后我计划使用vectors
转换为numpy
。
[0,0,0]
[2,0,0]
[0,0,0]
而不是..
printing from ./../lingspam_results/spmsgb165.txt.out
[]
printing from ./../lingspam_results/spmsgb166.txt.out
[('univers', 2)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('sale', 1)]
如何将Counter
模块的结果标准化为Ascending Order
(同时还将项目添加到我的search_list
可能不存在的计数器结果中)?我已尝试过下面的内容,只需从每个文本文件中读取并根据search_list
创建一个列表。
import numpy as np, os
from collections import Counter
def parse_bag(directory, search_list):
words = []
for (dirpath, dirnames, filenames) in os.walk(directory):
for f in filenames:
path = directory + "/" + f
count_words(path, search_list)
return;
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
wordfreq = Counter(filteredwords).most_common(5)
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
parse_bag("./../lingspam_results", search_list)
由于
答案 0 :(得分:3)
从你的问题来看,听起来你的要求就是你需要在所有文件中按照一致的顺序使用相同的单词。这应该适合你:
def count_words(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
counter[w] += 0 # ensure exists
wordfreq = sorted(counter.items())
print "printing from " + filename
print wordfreq
search_list = ['sale', 'univers', 'money']
示例输出:
printing from ./../lingspam_results/spmsgb164.txt.out
[('money', 0), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb166.txt.out
[('money', 2), ('sale', 0), ('univers', 0)]
printing from ./../lingspam_results/spmsgb167.txt.out
[('money', 0), ('sale', 1), ('univers', 0)]
我认为您根本不想使用most_common
,因为您特别不希望每个文件的内容影响排序或列表长度。
答案 1 :(得分:1)
您在示例中使用的调用Counter(filteredwords)
可以计算所有单词,就像您想要的那样 - 它不会做的是给您最常用的单词 - 也就是说,没有“most_common”方法 -
为此,你必须重新处理计数器中的所有项目,以便有一系列元组包含(频率,单词),并对其进行排序:
def most_common(counter, n=5):
freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
return [item[1] for item in freq[:n]]
答案 2 :(得分:1)
jsbueno和Mu Mind的组合
def count_words_SO(filename, search_list):
textwords = open(filename, 'r').read().split()
filteredwords = [t for t in textwords if t in search_list]
counter = Counter(filteredwords)
for w in search_list:
counter[w] += 0 # ensure exists
wordfreq = number_parse(counter)
print "printing from " + filename
print wordfreq
def number_parse(counter, n=5):
freq = sorted (((value ,item) for item, value in counter.viewitems() ), reverse=True)
return [item[0] for item in freq[:n]]
结束,再多做一点工作,我会准备好Neurel Network
感谢所有人:)
printing from ./../lingspam_results/spmsgb19.txt.out
[0, 0, 0]
printing from ./../lingspam_results/spmsgb2.txt.out
[4, 0, 0]
printing from ./../lingspam_results/spmsgb20.txt.out
[10, 0, 0]