在这段代码中,当term_frequency矩阵被规范化时,我创建了一个名为tf_normalize的新矩阵但是我的列车数据非常大并且它会造成内存错误,所以任何人都可以帮助我如何保存规范化的向量(tf_normalize)我制作的第一个矩阵,即term_frequency:
import nltk
from nltk import stem
from nltk.corpus import stopwords
stop = stopwords.words('english')
stemmer=stem.PorterStemmer()
word_list={}
import math
with open('train_data.txt','r') as traindata:
for line in traindata:
words=line.split()
for w in words:
w = stemmer.stem(w)
if w not in stop:
try:
word_list[w]+=1
except:
word_list[w]=1
## print(word_list)
print(len(word_list.keys()))
List_of_word_list=list(word_list.keys())
## print(List_of_word_list)
#creates the tf matrix
term_frequency=[]
with open('train_data.txt','r') as traindata:
for line in traindata:
words=line.split()
vocabulary=[]
for w in List_of_word_list:
vocabulary.append(words.count(w))
term_frequency.append(vocabulary)
## print(term_frequency)
print(len(term_frequency))
## calculates the magnitude of each vector in term_frequency matrix
def magnitude(v):
for vector in term_frequency:
return math.sqrt(sum(v[i]*v[i] for i in range(len(v))))
## normalizes the vectors of the term_frequency matrix
def normalize(v):
vmag=magnitude(v)
for vector in term_frequency:
return [ v[i]/vmag for i in range(len(vector))]
tf_normalize = []
for vector in term_frequency:
tf_normalize.append(normalize(vector))
print(tf_normalize)
for t in tf_normalize:
print(magnitude(t))
答案 0 :(得分:0)
延迟编辑:您将语料库称为traindata,将桌面文件data.txt称为traindata,这令人困惑。我认为后者是termdata。
您将term_frequency构建为列表并向其添加内容:
owner_id
所以你在term_frequency中为你的学期数据的每一行(称为traindata)都有一个条目,并且只要你的list_of_word_list,该条目就是一个列表。这是你的意思吗?也许这就是你的意思,但你并没有让你的数据更加精简。我猜你想得到你的语料库中你的termdata中单词的总数。如果term_frequency是一个字典,它只包含在语料库中的termdata中找到的单词数量的词典,那么它是不是更有效率,因此不在你的术语数据中的单词不会出现在term_frequency中,因为它们的计数为零?另外,你应该在你的termdata中扼杀这些词吗?
类似的东西:
...
for line in traindata:
words=line.split()
vocabulary=[]
for w in List_of_word_list:
vocabulary.append(words.count(w))
term_frequency.append(vocabulary)
...
或者,如果你真的想在term_frequency中为termdata的每一行输入一个条目,请将term_frequency保留为列表,但要使每个条目成为行中单词的字典,这些单词也在List_of_word_list中:
term_frequency = {}
for line in termdata:
words=line.split()
for w in words:
# assuming that words from termdata should be stemmed
stemmedword = stemmer.stem(w)
if stemmedword in List_of_word_list:
try:
term_frequency[stemmedword ] += 1
except:
term_frequency[stemmedword ] = 1
使用字典进行规范化可能看起来像这样(未经测试):
term_frequency = []
for line in termdata:
line_frequency = {}
words=line.split()
for w in words:
# assuming that words from termdata should be stemmed
stemmedword = stemmer.stem(w)
if stemmedword in List_of_word_list:
try:
line_frequency[stemmedword ] += 1
except:
line_frequency[stemmedword ] = 1
term_frequency.append(line_frequency)
祝你好运。过来。