PMI的两个列表的元素

时间:2016-11-06 15:51:34

标签: python similarity

我想计算两个列表元素的逐点互信息分数。 假设我们有

ListA = "Hi there, This is only a test message. Please enjoy the weather in the park."
ListB = "work, bank, tree, weather, sun"

如何计算所有对的PMI分数(工作,嗨),(工作,那里),(工作,这)....(太阳,公园)。

我可以计算一个列表中双字母组的PMI:

def pmi(word1, word2, unigram_freq, bigram_freq, unigram_freq_values, bigram_freq_values, output_name):
    prob_word1 = unigram_freq[word1] / float(sum(unigram_freq_values))
    prob_word2 = unigram_freq[word2] / float(sum(unigram_freq_values))
    prob_word1_word2 = bigram_freq / float(sum(bigram_freq_values))
    pmi =  math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

unigrams = nltk.FreqDist(ListA)
bigrams = ngrams(ListA,2)

n1_freq = nltk.FreqDist(unigrams)
n2_freq = nltk.FreqDist(bigrams)

output_pmi = "test.txt"
for bigram, freq in n2_freq.most_common(1000):
    w1 = bigram[0]
    w2 = bigram[1]
    unigram_freq_val = n1_freq.values()
    bigram_freq_val = n2_freq.values()
    pmi(w1, w2, unigrams, freq, unigram_freq_val, bigram_freq_val, output_pmi) 

我遇到了从ListA和ListB计算双字母PMI的问题。如果有人能帮助我,我真的很感激。非常感谢!

(这两个列表当然是我的任务看起来很简单的例子。)

1 个答案:

答案 0 :(得分:1)

如果您要查找两个列表的所有组合:

import itertools

ListA = "Hi there, This is only a test message. Please enjoy the weather in the park."
ListB = "work, bank, tree, weather, sun"
WordsA = ListA.split()
WordsB = ListB.split()
#print(WordsA, "\n\n", WordsB)              #This is to show what the new lists are
c = list(itertools.product(WordsA, WordsB))
print(c)