我试图获取某些单词的同义词但发现某些单词会出错。这是代码。
from pattern.en import wordnet as wn
def foo():
ss = 'man'
s = wn.synsets(ss)[0]
name = [item for item in [str(x) for x in s.synonyms]]
print name
foo()
如果我尝试使用怀孕或丑陋等词语,我会收到错误:
IndexError: list index out of range
可能是什么问题?
答案 0 :(得分:1)
似乎NLTK wordnet和Pattern wordnet界面之间存在某种差异:
$time = time() < strtotime('10:00am')
? strtotime('10:00am')
: strtotime('tomorrow 10:00am');
检查官方princeton wordnet,有13个同义词集,请参阅http://wordnetweb.princeton.edu/perl/webwn?s=man&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
检查>>> from nltk.corpus import wordnet as wn
>>> from pattern.en import wordnet as pwn
>>> wn.synsets('man')
[Synset('man.n.01'), Synset('serviceman.n.01'), Synset('man.n.03'), Synset('homo.n.02'), Synset('man.n.05'), Synset('man.n.06'), Synset('valet.n.01'), Synset('man.n.08'), Synset('man.n.09'), Synset('man.n.10'), Synset('world.n.08'), Synset('man.v.01'), Synset('man.v.02')]
>>> pwn.synsets('man')
[Synset(u'man'), Synset(u'serviceman'), Synset(u'man'), Synset(u'homo'), Synset(u'man'), Synset(u'man'), Synset(u'valet'), Synset(u'man'), Synset(u'Man'), Synset(u'man'), Synset(u'world')]
>>> len(wn.synsets('man'))
13
>>> len(pwn.synsets('man'))
11
代码,似乎与默认POS设置为&#39;名词&#39;有关。 (来自https://github.com/clips/pattern/blob/master/pattern/text/en/wordnet/init.py#L93)。
但是有一个&#34;陷阱&#34;对于POS参数,pattern
库在字符串中不起作用:
pattern
现在我们找到了2个缺失的同义词。
问:pattern.en wordnet是否有限? 答:否。
>>> pwn.synsets('man', pos='n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pattern/text/en/wordnet/__init__.py", line 109, in synsets
raise TypeError, "part of speech must be NOUN, VERB, ADJECTIVE or ADVERB, not %s" % repr(pos)
TypeError: part of speech must be NOUN, VERB, ADJECTIVE or ADVERB, not 'n'
>>> pwn.synsets('man', pos='NOUN')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pattern/text/en/wordnet/__init__.py", line 109, in synsets
raise TypeError, "part of speech must be NOUN, VERB, ADJECTIVE or ADVERB, not %s" % repr(pos)
TypeError: part of speech must be NOUN, VERB, ADJECTIVE or ADVERB, not 'noun'
>>> pwn.synsets('man', pos='nn')
[Synset(u'man'), Synset(u'serviceman'), Synset(u'man'), Synset(u'homo'), Synset(u'man'), Synset(u'man'), Synset(u'valet'), Synset(u'man'), Synset(u'Man'), Synset(u'man'), Synset(u'world')]
>>> pwn.synsets('man', pos='vb')
[Synset(u'man'), Synset(u'man')]
API
在pattern
中使用WordNet API时,如果POS不是名词,则需要指定POS,例如:
pattern
问:那我为什么会得到奇怪的IndexError?
A:鉴于上述检查,WordNet和Pattern正在使用相同的普林斯顿WordNet 3.0,因此不应该出现问题。下载/安装>>> from pattern.en import wordnet as wn
>>> wn.synsets('pregnant', pos='jj')
[Synset(u'pregnant'), Synset(u'meaning'), Synset(u'fraught')]
>>> wn.synsets('pregnant')
[]
>>> wn.synsets('quickly', pos='rb')
[Synset(u'quickly'), Synset(u'promptly'), Synset(u'cursorily')]
>>> wn.synsets('quickly')
[]
>>> wn.synsets('run', pos='nn')
[Synset(u'run'), Synset(u'test'), Synset(u'footrace'), Synset(u'streak'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'rivulet'), Synset(u'political campaign'), Synset(u'run'), Synset(u'discharge'), Synset(u'run'), Synset(u'run')]
>>> wn.synsets('run', pos='vb')
[Synset(u'run'), Synset(u'scat'), Synset(u'run'), Synset(u'operate'), Synset(u'run'), Synset(u'run'), Synset(u'function'), Synset(u'range'), Synset(u'campaign'), Synset(u'play'), Synset(u'run'), Synset(u'tend'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'prevail'), Synset(u'run'), Synset(u'run'), Synset(u'carry'), Synset(u'run'), Synset(u'guide'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'run'), Synset(u'ply'), Synset(u'hunt'), Synset(u'race'), Synset(u'move'), Synset(u'melt'), Synset(u'ladder'), Synset(u'run')]
时可能出现问题,请尝试重新安装:
pattern
问:对于wordnet访问,pip install -U pattern
是否比pattern
更快?
答:针对速度问题,nltk
和pattern
都将同义词存储为要提取的词典,因此我认为从词典中检索是等效的。加载nltk
和nltk
语料库时可能会有一些开销,所以我们的时间最长
wordnet