我正在尝试比较那些与语义相关的术语/表达 - 这些不是完整的句子,也不一定是单个单词;例如 -
'社交网络服务'和'社交网络'显然密切相关,但如何使用nltk对此进行量化?
显然,我甚至错过了一些代码:
w1 = wordnet.synsets('social network')
返回一个空列表。
关于如何解决这个问题的任何建议?
答案 0 :(得分:3)
有一些语义相关性或相似性的度量,但是对于wordnet的词典中的单个单词或单个表达式,它们更好地定义 - 据我所知,不是对于wordnet的词汇条目的复合词。
这是一个很好的基于相似单词网络的措施的网络实现
如果您有兴趣,可以进一步阅读使用wordnet相似性解释化合物(尽管不评估化合物的相似性):
答案 1 :(得分:2)
这是您可以使用的解决方案。
w1 = wordnet.synsets('social')
w2 = wordnet.synsets('network')
w1和w2将具有一组同义词。找到w1的每个synset与w2之间的相似性。具有最大相似性的那个为您提供组合的synset(这是您正在寻找的)。
这是完整的代码
from nltk.corpus import wordnet
x = 'social'
y = 'network'
xsyn = wordnet.synsets(x)
# xsyn
#[Synset('sociable.n.01'), Synset('social.a.01'), Synset('social.a.02'),
#Synset('social.a.03'), Synset('social.s.04'), Synset('social.s.05'),
#Synset('social.s.06')]
ysyn = wordnet.synsets(y)
#ysyn
#[Synset('network.n.01'), Synset('network.n.02'), Synset('net.n.06'),
#Synset('network.n.04'), Synset('network.n.05'), Synset('network.v.01')]
xlen = len(xsyn)
ylen = len(ysyn)
import numpy
simindex = numpy.zeros( (xlen,ylen) )
def relative_matrix(asyn,bsyn,simindex): # find similarity between asyn & bsyn
I = -1
J = -1
for asyn_element in asyn:
I += 1
cb = wordnet.synset(asyn_element.name)
J = -1
for bsyn_element in bsyn:
J += 1
ib = wordnet.synset(bsyn_element.name)
if not cb.pos == ib.pos: # compare nn , vv not nv or an
continue
score = cb.wup_similarity(ib)
r = cb.path_similarity(ib)
if simindex [I,J] < score:
simindex [I,J] = score
relative_matrix(xsyn,ysyn,simindex)
print simindex
'''
array([[ 0.46153846, 0.125 , 0.13333333, 0.125 , 0.125 ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ]])
'''
#xsyn[0].definition
#'a party of people assembled to promote sociability and communal activity'
#ysyn[0].definition
#'an interconnected system of things or people'
如果你看到simindex [0,0]是最大值0.46153846,那么xsyn [0]和ysyn [0]似乎最好描述w1 = wordnet.synsets('social network')
你可以看到定义。
答案 2 :(得分:1)
https://www.mashape.com/amtera/esa-semantic-relatedness
这是一个用于计算单词对或文本摘录之间语义相关性的Web API。
答案 3 :(得分:1)
import difflib
sm = difflib.SequenceMatcher(None)
sm.set_seq2('Social network')
#SequenceMatcher computes and caches detailed information
#about the second sequence, so if you want to compare one
#sequence against many sequences, use set_seq2() to set
#the commonly used sequence once and call set_seq1()
#repeatedly, once for each of the other sequences.
# (the doc)
for x in ('Social networking service',
'Social working service',
'Social ocean',
'Atlantic ocean',
'Atlantic and arctic oceans'):
sm.set_seq1(x)
print x,sm.ratio()
结果
Social networking service 0.717948717949
Social working service 0.611111111111
Social ocean 0.615384615385
Atlantic ocean 0.214285714286
Atlantic and arctic oceans 0.15
答案 4 :(得分:1)
可能你需要一个可以从NLTK返回wordnet Synset
对象的WSD模块。如果是这样,您可以看一下:https://github.com/alvations/pywsd
$ wget https://github.com/alvations/pywsd/archive/master.zip
$ unzip master.zip
$ cd pywsd/
$ ls
baseline.py cosine.py lesk.py README.md similarity.py test_wsd.py
$ python
>>> from similarity import max_similarity
>>> sent = 'I went to the bank to deposit my money'
>>> sim_choice = "lin" # Using Lin's (1998) similarity measure.
>>> print "Context:", sent
>>> print "Similarity:", sim_choice
>>> answer = max_similarity(sent, 'bank', sim_choice)
>>> print "Sense:", answer
>>> print "Definition", answer.definition
[OUT]:
Context: I went to the bank to deposit my money
Similarity: lch
Sense: Synset('bank.n.09')
Definition a building in which the business of banking transacted