将主题模型存储在列表中,同时考虑最大出现次数

时间:2018-11-10 18:21:02

标签: python python-3.x

我正在执行主题建模,并使用函数来获取主题模型中的热门关键字,如下所示。

def getTopKWords(self, K):

    results  = []
    """
    returns top K discriminative words for topic t
    ie words v for which p(v|t) is maximum
    """
    index = []
    key_terms = []



    pseudocounts = np.copy(self.n_vt)
    normalizer = np.sum(pseudocounts, (0))
    pseudocounts /= normalizer[np.newaxis, :]
    for t in range(self.numTopics):
        topWordIndices = pseudocounts[:, t].argsort()[-1:-(K+1):-1]
        vocab = self.vectorizer.get_feature_names()
        print (t, [vocab[i] for i in topWordIndices])
   ## Code for storing the values in a single list
   return results

以上功能为我提供了如图所示的代码

0 ['computer', 'laptop', 'mac', 'use', 'bought', 'like', 'warranty', 'screen', 'way', 'just']
1 ['laptop', 'computer', 'use', 'just', 'like', 'time', 'great', 'windows', 'macbook', 'months']
2 ['computer', 'great', 'laptop', 'mac', 'buy', 'just', 'macbook', 'use', 'pro', 'windows']
3 ['laptop', 'computer', 'great', 'time', 'battery', 'use', 'apple', 'love', 'just', 'work']

它是循环运行4次并打印索引和每个vocab中的所有关键字的结果。

现在,我想从该函数返回一个列表,该列表将返回以下输出。

return   [keyword1, keyword2, keyword3, keyword4]

其中,关键字1/2/3/4是在词汇表中出现次数最多,输出索引为0、1、2、3的单词。

1 个答案:

答案 0 :(得分:1)

您可以使用collection.Counter

from collections import Counter

a = ['computer', 'laptop', 'mac', 'use', 'bought', 'like', 
     'warranty', 'screen', 'way', 'just']
b = ['laptop', 'computer', 'use', 'just', 'like', 'time', 
     'great', 'windows', 'macbook', 'months']
c = ['computer', 'great', 'laptop', 'mac', 'buy', 'just', 
     'macbook', 'use', 'pro', 'windows']
d = ['laptop', 'computer', 'great', 'time', 'battery', 'use', 
     'apple', 'love', 'just', 'work']

def get_most_common(*kwargs):
    """Accepts iterables, feeds all into Counter and returns the Counter instance"""
    c = Counter()
    for k in kwargs:
        c.update(k)
    return c

# get the most common ones 
mc = get_most_common(a,b,c,d).most_common()

# print top 4 keys
top4 = [k for k,v in mc[0:4]]
print (top4)

输出:

['computer', 'laptop', 'use', 'just']

 some_results = [] # store stuff
for t in range(self.numTopics):
    topWordIndices = pseudocounts[:, t].argsort()[-1:-(K+1):-1]
    vocab = self.vectorizer.get_feature_names()
    print (t, [vocab[i] for i in topWordIndices])
      some_results.append( [vocab[i] for i in topWordIndices] )

  mc = get_most_common(*some_results).most_common()
  return [k for k,v in mc[0:4]]