from __future__ import division
import urllib
import json
from math import log
def hits(word1,word2=""):
query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
if word2 == "":
results = urllib.urlopen(query % word1)
else:
results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
json_res = json.loads(results.read())
google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
return google_hits
def so(phrase):
num = hits(phrase,"excellent")
#print num
den = hits(phrase,"poor")
#print den
ratio = num / den
#print ratio
sop = log(ratio)
return sop
print so("ugly product")
我需要此代码来计算Point wise Mutual Information,它可用于将评论分类为正面或负面。基本上我使用Turney(2002)指定的技术:http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf作为情感分析的无监督分类方法的一个例子。
正如文中所解释的,如果短语与单词"穷人"更强烈相关,则短语的语义方向是否定的。如果与“"优秀"。
这个词有更强烈的联系,那就是积极的。”上面的代码计算短语的SO。我使用谷歌计算命中数并计算SO。(因为AltaVista现在不存在)
计算出的值非常不稳定。他们没有坚持特定的模式。 例如,SO("丑陋的产品")结果是2.85462098541而SO("漂亮的产品")是1.71395061117。虽然前者预计是负面而另一个是积极的。
代码有问题吗?是否有更简单的方法来计算任何Python库(如NLTK)的短语(使用PMI)的SO?我尝试过NLTK,但无法找到任何计算PMI的明确方法。
答案 0 :(得分:14)
通常,计算PMI很棘手,因为公式会根据您要考虑的ngram的大小而改变:
数学上,对于双字母,你可以简单地考虑一下:
log(p(a,b) / ( p(a) * p(b) ))
以编程方式,假设你已经计算了语料库中unigrams和bigrams的所有频率,你可以这样做:
def pmi(word1, word2, unigram_freq, bigram_freq):
prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)
这是来自MWE库的代码片段,但它处于开发前阶段(https://github.com/alvations/Terminator/blob/master/mwe.py)。但要注意它是用于并行MWE提取,所以这里是你如何“破解”它来提取单语MWE:
$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe
>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')
>>> sent = "This is another foo bar sentence not in the training corpus ."
>>> for threshold in range(-2, 4):
... print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]
[OUT]:
-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []
有关详细信息,我发现本文快速简单地介绍了MWE提取:“扩展对数似然度量以改进搭配识别”,请参阅http://goo.gl/5ebTJJ
答案 1 :(得分:5)
Python库DISSECT包含共生矩阵的a few methods to compute Pointwise Mutual Information。
示例:
Enter a valid option
Code on GitHub for the PMI methods
参考: Georgiana Dinu,Nghia The Pham和Marco Baroni。 2013. DISSECT: DIStributional SEmantics Composition Toolkit。在系统演示的论文集中 ACL 2013,索非亚,保加利亚
相关:Calculating pointwise mutual information between two strings
答案 2 :(得分:3)
要回答结果不稳定的原因,请务必了解Google搜索不是字频率的可靠来源。引擎返回的频率仅仅是在查询多个单词时特别不准确并且可能相互矛盾的估计。这不是为了抨击谷歌,但它不是频率计数的实用工具。因此,您的实施可能没问题,但在此基础上的结果仍然可能是非感性的。
如需更深入地讨论此事,请阅读Adam Kilgarriff撰写的“Googleology is bad science”。