我使用LESK算法从文本中获取SynSets。但是我用相同的输入得到不同的结果。 是Lesk算法"功能"或者我做错了什么? 接下来是我使用的代码:
self.SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
self.sentNum=0;
for sentence in sentences:
raw_tokens = word_tokenize(sentence)
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
self.SynSets.append(synset)
self.SynSets = set(self.SynSets)
self.WriteSynSets()
return self
在输出中我有结果(前2个不同发射的前3个结果):
Synset('allow.v.09') Synset('code.n.03') Synset('coffee.n.01')
------------
Synset('allow.v.09') Synset('argumentation.n.02') Synset('boastfully.r.01')
如果有另一种(更稳定的)方式来获取同义词,我将非常感谢你的帮助。
提前致谢。
被修改
有关其他示例,此处是我已经运行了两次的完整脚本:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords
SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
for sentence in sentences:
raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
#removing stopwords and words, smaller than 3 characters
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
SynSets.append(synset)
SynSets = set(SynSets)
SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
file.write("\n-------------------\n")
for synset in SynSets:
file.write("{} ".format(str(synset.__str__())))
file.close()
我得到了这些结果(前4个导致的同义词在文件中写入了2次每次运行程序):
Synset(' allow.v.04')Synset(' boastfully.r.01')Synset(' clear.v.11' )Synset(' code.n.02')
Synset(' boastfully.r.01')Synset(' clear.v.19')Synset(' code.n.01' )Synset(' design.n.04')
解: 我有什么问题。重新安装python 2.7后,所有问题都消失了。 所以,不要使用带有lesk算法的python 3.x.
答案 0 :(得分:2)
最新版本的NLTK中有lesk算法的wsd函数:
>>> from nltk.wsd import lesk
>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
... for word in word_tokenize(sent):
... print word, lesk(sent, word), sent
[OUT]:
Python Synset('python.n.02') Python is a widely used general-purpose, high-level programming language.
is Synset('be.v.08') Python is a widely used general-purpose, high-level programming language.
a Synset('angstrom.n.01') Python is a widely used general-purpose, high-level programming language.
widely Synset('wide.r.04') Python is a widely used general-purpose, high-level programming language.
used Synset('use.v.01') Python is a widely used general-purpose, high-level programming language.
general-purpose None Python is a widely used general-purpose, high-level programming language.
, None Python is a widely used general-purpose, high-level programming language.
另外,请尝试disambiguate()
pywsd
中的>>> from pywsd import disambiguate>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
... print disambiguate(sent, prefersNone=True)
...
:
[('Python', Synset('python.n.02')), ('is', None), ('a', None), ('widely', Synset('widely.r.03')), ('used', Synset('used.a.01')), ('general-purpose', None), (',', None), ('high-level', None), ('programming', Synset('scheduling.n.01')), ('language', Synset('terminology.n.01')), ('.', None)]
[('Its', None), ('design', Synset('purpose.n.01')), ('philosophy', Synset('philosophy.n.03')), ('emphasizes', Synset('stress.v.01')), ('code', Synset('code.n.03')), ('readability', Synset('readability.n.01')), (',', None), ('and', None), ('its', None), ('syntax', Synset('syntax.n.03')), ('allows', Synset('let.v.01')), ('programmers', Synset('programmer.n.01')), ('to', None), ('express', Synset('express.n.03')), ('concepts', Synset('concept.n.01')), ('in', None), ('fewer', None), ('lines', Synset('wrinkle.n.01')), ('of', None), ('code', Synset('code.n.03')), ('than', None), ('would', None), ('be', None), ('possible', Synset('potential.a.01')), ('in', None), ('languages', Synset('linguistic_process.n.02')), ('such', None), ('as', None), ('C++', None), ('or', None), ('Java', Synset('java.n.03')), ('.', None)]
[('The', None), ('language', Synset('language.n.01')), ('provides', Synset('provide.v.06')), ('constructs', Synset('concept.n.01')), ('intended', Synset('mean.v.03')), ('to', None), ('enable', None), ('clear', Synset('open.n.01')), ('programs', Synset('program.n.08')), ('on', None), ('both', None), ('a', None), ('small', Synset('small.a.01')), ('and', None), ('large', Synset('large.a.01')), ('scale', Synset('scale.n.10')), ('.', None)]
[('Python', Synset('python.n.02')), ('supports', Synset('support.n.11')), ('multiple', None), ('programming', Synset('program.v.02')), ('paradigms', Synset('substitution_class.n.01')), (',', None), ('including', Synset('include.v.03')), ('object-oriented', None), (',', None), ('imperative', Synset('imperative.a.02')), ('and', None), ('functional', Synset('functional.a.01')), ('programming', Synset('scheduling.n.01')), ('or', None), ('procedural', Synset('procedural.a.01')), ('styles', Synset('vogue.n.01')), ('.', None)]
[OUT]:
from nltk.wsd import lesk
from nltk import sent_tokenize, word_tokenize
text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
lst = []
for sent in sent_tokenize(text):
lst = []
for word in word_tokenize(sent):
lst.append(lesk(sent, word))
for i in range(10):
lst2 = []
for word in word_tokenize(sent):
lst2.append(lesk(sent, word))
assert lst2 == lst
他们并不完美,但他们接近lesk的准确实施。
<强> EDITED 强>
要确保每次运行时结果都相同,执行此操作时应该没有STDOUT:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords
def run():
SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
for sentence in sentences:
raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
#removing stopwords and words, smaller than 3 characters
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
SynSets.append(synset)
return sorted(set(SynSets))
run1 = run()
for i in range(10):
assert run1 == run()
我运行OP的代码10次,但结果相同:
{{1}}