如何将结果保存在前一个矩阵中而不是创建一个新矩阵

时间:2015-07-03 09:53:23

标签: python

在这段代码中,当term_frequency矩阵被规范化时,我创建了一个名为tf_normalize的新矩阵但是我的列车数据非常大并且它会造成内存错误,所以任何人都可以帮助我如何保存规范化的向量(tf_normalize)我制作的第一个矩阵,即term_frequency:

      import nltk
      from nltk import stem
      from nltk.corpus import stopwords
      stop = stopwords.words('english')
      stemmer=stem.PorterStemmer()

      word_list={}
      import math
      with open('train_data.txt','r') as traindata:
      for line in traindata:
        words=line.split()
        for w in words:
            w = stemmer.stem(w)
            if w not in stop:
                try:
                    word_list[w]+=1

                except:
                    word_list[w]=1


   ##        print(word_list)
    print(len(word_list.keys()))
          List_of_word_list=list(word_list.keys())

   ##    print(List_of_word_list)

    #creates the tf matrix
    term_frequency=[]
    with open('train_data.txt','r') as traindata:
        for line in traindata:
             words=line.split()
             vocabulary=[]
             for w in List_of_word_list:
                vocabulary.append(words.count(w))
             term_frequency.append(vocabulary)
    ##    print(term_frequency)
    print(len(term_frequency))

   ## calculates the magnitude of each vector in term_frequency matrix
   def magnitude(v):
    for vector in term_frequency:
        return math.sqrt(sum(v[i]*v[i] for i in range(len(v))))

   ## normalizes the vectors of the term_frequency matrix
   def normalize(v):
     vmag=magnitude(v)
     for vector in term_frequency:
        return [ v[i]/vmag for i in range(len(vector))]

   tf_normalize = []
   for vector in term_frequency:

        tf_normalize.append(normalize(vector))

    print(tf_normalize)

   for t in tf_normalize:
      print(magnitude(t))

1 个答案:

答案 0 :(得分:0)

延迟编辑:您将语料库称为traindata,将桌面文件data.txt称为traindata,这令人困惑。我认为后者是termdata。

您将term_frequency构建为列表并向其添加内容:

owner_id

所以你在term_frequency中为你的学期数据的每一行(称为traindata)都有一个条目,并且只要你的list_of_word_list,该条目就是一个列表。这是你的意思吗?也许这就是你的意思,但你并没有让你的数据更加精简。我猜你想得到你的语料库中你的termdata中单词的总数。如果term_frequency是一个字典,它只包含在语料库中的termdata中找到的单词数量的词典,那么它是不是更有效率,因此不在你的术语数据中的单词不会出现在term_frequency中,因为它们的计数为零?另外,你应该在你的termdata中扼杀这些词吗?

类似的东西:

   ...
   for line in traindata:
     words=line.split()
     vocabulary=[]
     for w in List_of_word_list:
        vocabulary.append(words.count(w))
     term_frequency.append(vocabulary)
   ...

或者,如果你真的想在term_frequency中为termdata的每一行输入一个条目,请将term_frequency保留为列表,但要使每个条目成为行中单词的字典,这些单词也在List_of_word_list中:

   term_frequency = {}
   for line in termdata:
     words=line.split()
     for w in words:
       # assuming that words from termdata should be stemmed
       stemmedword = stemmer.stem(w)
       if stemmedword in List_of_word_list:
         try:
           term_frequency[stemmedword ] += 1
         except:
           term_frequency[stemmedword ] = 1

使用字典进行规范化可能看起来像这样(未经测试):

   term_frequency = []
   for line in termdata:
     line_frequency = {}
     words=line.split()
     for w in words:
       # assuming that words from termdata should be stemmed
       stemmedword = stemmer.stem(w)
       if stemmedword in List_of_word_list:
         try:
           line_frequency[stemmedword ] += 1
         except:
           line_frequency[stemmedword ] = 1
     term_frequency.append(line_frequency)
祝你好运。过来。