Question

我的目标是计算以下文字的PMI： a= 'When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him“

formula: PMI-IR (w1, w2) = log2 p(w1&w2)/p(w1)*p(w2); p=probability, w=word 

My attempt:
>>> from nltk import bigrams
>>> import collections
>>> a1=a.split()    
>>> a2=collections.Counter(a1)
>>> a3=collections.Counter(bigrams(a1))
>>> a4=sum([a2[x]for x in a2])
>>> a5=sum([a3[x]for x in a3])
>>> a6={x:float(a2[x])/a4 for x in a2} # word probabilities(w1 and w2)
>>> a7={x:float(a3[x])/a5 for x in a3} # joint probabilites (w1&w2)
>>> for x in a6:
    k={x:round(log(a7[b]/(a6[x] * a6[y]),2),4) for b in a7 for y in a6 if x and y in b}
    u.append(k)
>>> u
[{'and': 4.3959}, {'on': 4.3959}, {'his': 4.3959}, {'When': 4.3959}.....}]

由于以下原因，我得到的结果似乎不正确（1）我想要一个大字典并为每个项目获得许多小字典。（2）概率可能没有正确地拟合到方程中，因为这是我第一次尝试这个问题。

有什么建议吗？感谢。

Answer 1

我不是NLP专家，但您的等式看起来很好。实现有一个微妙的bug。考虑以下优先级深度潜水：

"""Precendence deep dive"""
'hi' and True #returns true regardless of what the contents of the string
'hi' and False #returns false
b = ('hi','bob')
'hi' and 'bob' in b #returns true BUT not because 'hi' is in b!!!
'hia' and 'bob' in b #returns true as the precedence is 'hia' and ('bob' in b)
result2 = 'bob' in b
'hia' and result2 #returns true and shows the precedence more clearly
'hi' and 'boba' in b #returns false  

#each string needs to check in b
'hi' in b and 'bob' in b #return true!!
'hia' in b and 'bob' in b #return false!!
'hi' in b and 'boba' in b #return false!! - same as before but now each string is checked separately

注意联合概率u和v的差异.u包含错误的优先级，v包含正确的优先级

from nltk import bigrams
import collections

a= """When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs on him.  if we have more data then it will be more interesting because we have more chance to repeat bigrams. After some of the victim supporters turned their backs then a subset of the victim supporters turned around and left the court."""

a1=a.split() 
a2=collections.Counter(a1)

a3=collections.Counter(bigrams(a1))
a4=sum([a2[x]for x in a2])
a5=sum([a3[x]for x in a3])
a6={x:float(a2[x])/a4 for x in a2} # word probabilities(w1 and w2)
a7={x:float(a3[x])/a5 for x in a3} # joint probabilites (w1&w2)
u = {}
v = {}
for x in a6:
  k={x:round(math.log((a7[b]/(a6[x] * a6[y])),2),4) for b in a7 for y in a6 if x and y in b}
  u[x] = k[x]
  k={x:round(math.log((a7[b]/(a6[x] * a6[y])),2),4) for b in a7 for y in a6 if x in b and y in b}
  v[x] = k[x]

u['the']
v['the']

使用python计算文本文档的Pointwise Mutual信息

1 个答案: