NLTK。 Lesk为同一输入返回不同的结果

时间:2015-02-10 20:19:40

标签: python-3.x nltk wordnet disambiguation

我使用LESK算法从文本中获取SynSets。但是我用相同的输入得到不同的结果。 是Lesk算法"功能"或者我做错了什么? 接下来是我使用的代码:

    self.SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')
    self.sentNum=0;
    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                self.SynSets.append(synset)

    self.SynSets = set(self.SynSets)
    self.WriteSynSets()
    return self

在输出中我有结果(前2个不同发射的前3个结果):

Synset('allow.v.09')   Synset('code.n.03')   Synset('coffee.n.01') 
------------
Synset('allow.v.09')   Synset('argumentation.n.02')   Synset('boastfully.r.01')  

如果有另一种(更稳定的)方式来获取同义词,我将非常感谢你的帮助。

提前致谢。


被修改

有关其他示例,此处是我已经运行了两次的完整脚本:

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
    Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
    The language provides constructs intended to enable clear programs on both a small and large scale.\
    Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
    ")
stopwordsList =  stopwords.words('english')

for sentence in sentences:
    raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
    #removing stopwords and words, smaller than 3 characters
    final_tokens = [token.lower() for token in raw_tokens 
                if(not token in stopwordsList) 
                #and (len(token) > 3) 
                and not token.isdigit()]
    for token in final_tokens:
        synset = wsd.lesk(sentence, token)
        if not synset is None:
            SynSets.append(synset)


SynSets = set(SynSets)

SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
    file.write("\n-------------------\n")
    for synset in SynSets:
        file.write("{}   ".format(str(synset.__str__())))
file.close()

我得到了这些结果(前4个导致的同义词在文件中写入了2次每次运行程序):

  • Synset(' allow.v.04')Synset(' boastfully.r.01')Synset(' clear.v.11' )Synset(' code.n.02')

  • Synset(' boastfully.r.01')Synset(' clear.v.19')Synset(' code.n.01' )Synset(' design.n.04')

解: 我有什么问题。重新安装python 2.7后,所有问题都消失了。 所以,不要使用带有lesk算法的python 3.x.

1 个答案:

答案 0 :(得分:2)

最新版本的NLTK中有lesk算法的wsd函数:

>>> from nltk.wsd import lesk
>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     for word in word_tokenize(sent):
...             print word, lesk(sent, word), sent

[OUT]:

Python Synset('python.n.02') Python is a widely used general-purpose, high-level programming language.
is Synset('be.v.08') Python is a widely used general-purpose, high-level programming language.
a Synset('angstrom.n.01') Python is a widely used general-purpose, high-level programming language.
widely Synset('wide.r.04') Python is a widely used general-purpose, high-level programming language.
used Synset('use.v.01') Python is a widely used general-purpose, high-level programming language.
general-purpose None Python is a widely used general-purpose, high-level programming language.
, None Python is a widely used general-purpose, high-level programming language.

另外,请尝试disambiguate() pywsd中的>>> from pywsd import disambiguate>>> from nltk import sent_tokenize >>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles." >>> for sent in sent_tokenize(text): ... print disambiguate(sent, prefersNone=True) ...

[('Python', Synset('python.n.02')), ('is', None), ('a', None), ('widely', Synset('widely.r.03')), ('used', Synset('used.a.01')), ('general-purpose', None), (',', None), ('high-level', None), ('programming', Synset('scheduling.n.01')), ('language', Synset('terminology.n.01')), ('.', None)]
[('Its', None), ('design', Synset('purpose.n.01')), ('philosophy', Synset('philosophy.n.03')), ('emphasizes', Synset('stress.v.01')), ('code', Synset('code.n.03')), ('readability', Synset('readability.n.01')), (',', None), ('and', None), ('its', None), ('syntax', Synset('syntax.n.03')), ('allows', Synset('let.v.01')), ('programmers', Synset('programmer.n.01')), ('to', None), ('express', Synset('express.n.03')), ('concepts', Synset('concept.n.01')), ('in', None), ('fewer', None), ('lines', Synset('wrinkle.n.01')), ('of', None), ('code', Synset('code.n.03')), ('than', None), ('would', None), ('be', None), ('possible', Synset('potential.a.01')), ('in', None), ('languages', Synset('linguistic_process.n.02')), ('such', None), ('as', None), ('C++', None), ('or', None), ('Java', Synset('java.n.03')), ('.', None)]
[('The', None), ('language', Synset('language.n.01')), ('provides', Synset('provide.v.06')), ('constructs', Synset('concept.n.01')), ('intended', Synset('mean.v.03')), ('to', None), ('enable', None), ('clear', Synset('open.n.01')), ('programs', Synset('program.n.08')), ('on', None), ('both', None), ('a', None), ('small', Synset('small.a.01')), ('and', None), ('large', Synset('large.a.01')), ('scale', Synset('scale.n.10')), ('.', None)]
[('Python', Synset('python.n.02')), ('supports', Synset('support.n.11')), ('multiple', None), ('programming', Synset('program.v.02')), ('paradigms', Synset('substitution_class.n.01')), (',', None), ('including', Synset('include.v.03')), ('object-oriented', None), (',', None), ('imperative', Synset('imperative.a.02')), ('and', None), ('functional', Synset('functional.a.01')), ('programming', Synset('scheduling.n.01')), ('or', None), ('procedural', Synset('procedural.a.01')), ('styles', Synset('vogue.n.01')), ('.', None)]

[OUT]:

from nltk.wsd import lesk
from nltk import sent_tokenize, word_tokenize
text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."

lst = []
for sent in sent_tokenize(text):
    lst = []
    for word in word_tokenize(sent):
        lst.append(lesk(sent, word))
    for i in range(10):
        lst2 = []
        for word in word_tokenize(sent):
            lst2.append(lesk(sent, word))
        assert lst2 == lst

他们并不完美,但他们接近lesk的准确实施。


<强> EDITED

要确保每次运行时结果都相同,执行此操作时应该没有STDOUT:

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

def run():
    SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')

    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
        #removing stopwords and words, smaller than 3 characters
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                SynSets.append(synset)
    return sorted(set(SynSets))

run1 = run()

for i in range(10):
    assert run1 == run()

我运行OP的代码10次,但结果相同:

{{1}}