使用计数矢量化器找到PMI的更快方法

时间:2016-06-04 15:37:24

标签: python numpy nlp scikit-learn

首先,我找到了一个术语文档矩阵,即一个用文档数量维度表示的术语。

为了找到PMI,我发现双字母组的数量为this is,重要字符thisis中的单个字数,然后按(4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2)) <中的计算方式计算/ p>

有更快的方法来实现这一目标吗?我对numpyscikit不熟悉。

请注意,我需要在列表bigramFeatures

中找到每个可能的二元组的值
f4 = ['this is sentence1','not sentence1 becuase this is not sentence1','why this this is called this is sentence1, its always setence1','fourth time this is not sentene1']


Vcount = CountVectorizer(analyzer='word',ngram_range=(1,2),stop_words='english')
countMatrix = Vcount.fit_transform(f4)

# all unigrams and bigrams
feature_names = Vcount.get_feature_names()

#finding all bigrams
featureBigrams = [item for item in Vcount.get_feature_names() if len(item.split()) == 2 ]

#document term matrix
arrays = countMatrix.toarray()

#term document matrix
arrayTrans = arrays.transpose()

from collections import defaultdict
PMIMatrix = defaultdict(dict)

import math
import numpy
print len(featureBigrams)
i = 0
PMIMatrix = defaultdict(dict)
for item in featureBigrams:
    words = item.split()
    bigramLength = len(numpy.where(arrayTrans[feature_names.index(item)] > 0)[0])
    if bigramLength < 2:
        continue
    word0Length = len(numpy.where(arrayTrans[feature_names.index(words[0])] > 0)[0])
    word1Length = len(numpy.where(arrayTrans[feature_names.index(words[1])] > 0)[0])
    try:
        PMIMatrix[words[0]][words[1]] = (4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
    except:
        print bigramLength,word0Length,word1Length

0 个答案:

没有答案