使用NLTK比较术语/表达的相似性?

时间:2013-06-01 21:38:04

标签: python nltk

我正在尝试比较那些与语义相关的术语/表达 - 这些不是完整的句子,也不一定是单个单词;例如 -

'社交网络服务'和'社交网络'显然密切相关,但如何使用nltk对此进行量化?

显然,我甚至错过了一些代码:

w1 = wordnet.synsets('social network')

返回一个空列表。

关于如何解决这个问题的任何建议?

5 个答案:

答案 0 :(得分:3)

有一些语义相关性或相似性的度量,但是对于wordnet的词典中的单个单词或单个表达式,它们更好地定义 - 据我所知,不是对于wordnet的词汇条目的复合词。

这是一个很好的基于相似单词网络的措施的网络实现

如果您有兴趣,可以进一步阅读使用wordnet相似性解释化合物(尽管不评估化合物的相似性):

答案 1 :(得分:2)

这是您可以使用的解决方案。

     w1 = wordnet.synsets('social')
     w2 = wordnet.synsets('network')

w1和w2将具有一组同义词。找到w1的每个synset与w2之间的相似性。具有最大相似性的那个为您提供组合的synset(这是您正在寻找的)。

这是完整的代码

from nltk.corpus import wordnet
x = 'social'
y = 'network'
xsyn = wordnet.synsets(x)
# xsyn
#[Synset('sociable.n.01'), Synset('social.a.01'), Synset('social.a.02'),   
#Synset('social.a.03'), Synset('social.s.04'), Synset('social.s.05'),   
#Synset('social.s.06')]

ysyn = wordnet.synsets(y)
#ysyn
#[Synset('network.n.01'), Synset('network.n.02'), Synset('net.n.06'), 
#Synset('network.n.04'), Synset('network.n.05'), Synset('network.v.01')]

xlen = len(xsyn)
ylen = len(ysyn)

import numpy
simindex = numpy.zeros( (xlen,ylen) )

def relative_matrix(asyn,bsyn,simindex): # find similarity between asyn & bsyn

    I = -1
    J = -1

    for asyn_element in asyn:
        I += 1

        cb = wordnet.synset(asyn_element.name)
        J = -1
        for bsyn_element in bsyn:
            J += 1
            ib = wordnet.synset(bsyn_element.name)
            if not cb.pos == ib.pos: # compare nn , vv not nv or an
                continue
            score = cb.wup_similarity(ib)
            r = cb.path_similarity(ib)
            if simindex [I,J] < score:
                simindex [I,J] = score

 relative_matrix(xsyn,ysyn,simindex)
 print simindex
'''
array([[ 0.46153846,  0.125     ,  0.13333333,  0.125     ,  0.125     ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ]])
'''
#xsyn[0].definition
#'a party of people assembled to promote sociability and communal activity'
#ysyn[0].definition
#'an interconnected system of things or people'

如果你看到simindex [0,0]是最大值0.46153846,那么xsyn [0]和ysyn [0]似乎最好描述w1 = wordnet.synsets('social network')你可以看到定义。

答案 2 :(得分:1)

https://www.mashape.com/amtera/esa-semantic-relatedness

这是一个用于计算单词对或文本摘录之间语义相关性的Web API。

答案 3 :(得分:1)

import difflib

sm = difflib.SequenceMatcher(None)

sm.set_seq2('Social network')
#SequenceMatcher computes and caches detailed information
#about the second sequence, so if you want to compare one
#sequence against many sequences, use set_seq2() to set
#the commonly used sequence once and call set_seq1()
#repeatedly, once for each of the other sequences.
# (the doc)

for x in ('Social networking service',
          'Social working service',
          'Social ocean',
          'Atlantic ocean',
          'Atlantic and arctic oceans'):
    sm.set_seq1(x)
    print x,sm.ratio()

结果

Social networking service 0.717948717949
Social working service 0.611111111111
Social ocean 0.615384615385
Atlantic ocean 0.214285714286
Atlantic and arctic oceans 0.15

答案 4 :(得分:1)

可能你需要一个可以从NLTK返回wordnet Synset对象的WSD模块。如果是这样,您可以看一下:https://github.com/alvations/pywsd

$ wget https://github.com/alvations/pywsd/archive/master.zip
$ unzip master.zip
$ cd pywsd/
$ ls
baseline.py  cosine.py  lesk.py  README.md  similarity.py  test_wsd.py
$ python
>>> from similarity import max_similarity
>>> sent = 'I went to the bank to deposit my money'
>>> sim_choice = "lin" # Using Lin's (1998) similarity measure.
>>> print "Context:", sent
>>> print "Similarity:", sim_choice 
>>> answer = max_similarity(sent, 'bank', sim_choice)
>>> print "Sense:", answer
>>> print "Definition", answer.definition

[OUT]:

Context: I went to the bank to deposit my money
Similarity: lch
Sense: Synset('bank.n.09')
Definition a building in which the business of banking transacted