首先,我找到了一个术语文档矩阵,即一个用文档数量维度表示的术语。
为了找到PMI,我发现双字母组的数量为this is
,重要字符this
和is
中的单个字数,然后按(4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
<中的计算方式计算/ p>
有更快的方法来实现这一目标吗?我对numpy
或scikit
不熟悉。
请注意,我需要在列表bigramFeatures
f4 = ['this is sentence1','not sentence1 becuase this is not sentence1','why this this is called this is sentence1, its always setence1','fourth time this is not sentene1']
Vcount = CountVectorizer(analyzer='word',ngram_range=(1,2),stop_words='english')
countMatrix = Vcount.fit_transform(f4)
# all unigrams and bigrams
feature_names = Vcount.get_feature_names()
#finding all bigrams
featureBigrams = [item for item in Vcount.get_feature_names() if len(item.split()) == 2 ]
#document term matrix
arrays = countMatrix.toarray()
#term document matrix
arrayTrans = arrays.transpose()
from collections import defaultdict
PMIMatrix = defaultdict(dict)
import math
import numpy
print len(featureBigrams)
i = 0
PMIMatrix = defaultdict(dict)
for item in featureBigrams:
words = item.split()
bigramLength = len(numpy.where(arrayTrans[feature_names.index(item)] > 0)[0])
if bigramLength < 2:
continue
word0Length = len(numpy.where(arrayTrans[feature_names.index(words[0])] > 0)[0])
word1Length = len(numpy.where(arrayTrans[feature_names.index(words[1])] > 0)[0])
try:
PMIMatrix[words[0]][words[1]] = (4.0 * math.log(1.0*bigramLength,2))/(1.0*math.log(word0Length,2)*math.log(1.0*word1Length,2))
except:
print bigramLength,word0Length,word1Length