考虑以下基础:
basis = "Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay."
和以下词语:
words = "word, text, bank, tree"
如何计算"单词"中每个单词的PMI值?与"基础"中的每个单词相比,我可以使用大小为5的上下文窗口(即目标单词之前和之后的两个位置)?
我知道如何计算PMI,但我不知道如何处理上下文窗口的事实。
我计算了正常情况' PMI值如下:
def PMI(ContingencyTable):
(a,b,c,d,N) = ContingencyTable
# avoid log(0)
a += 1
b += 1
c += 1
d += 1
N += 4
R_1 = a + b
C_1 = a + c
return log(float(a)/(float(R_1)*float(C_1))*float(N),2)
答案 0 :(得分:0)
我对PMI进行了一些搜索,看起来像那里有重型包装,"窗口"包括
在PMI中,#34;相互"似乎是指两个不同单词的联合概率,因此你需要在问题陈述中坚定这个想法
我接受了一个较小的问题,就是在你的问题陈述中生成短窗口列表主要是为了我自己的练习
def wndw(wrd_l, m_l, pre, post):
"""
returns a list of all lists of sequential words in input wrd_l
that are within range -pre and +post of any word in wrd_l that matches
a word in m_l
wrd_l = list of words
m_l = list of words to match on
pre, post = ints giving range of indices to include in window size
"""
wndw_l = list()
for i, w in enumerate(wrd_l):
if w in m_l:
wndw_l.append([wrd_l[i + k] for k in range(-pre, post + 1)
if 0 <= (i + k ) < len(wrd_l)])
return wndw_l
basis = """Each word of the text is converted as follows: move any
consonant (or consonant cluster) that appears at the start
of the word to the end, then append ay."""
words = "word, text, bank, tree"
print(*wndw(basis.split(), [x.strip() for x in words.split(',')], 2, 2),
sep="\n")
['Each', 'word', 'of', 'the']
['of', 'the', 'text', 'is', 'converted']
['of', 'the', 'word', 'to', 'the']