我有一套感应键,例如" long%3:00:02 ::"来自SemCor + OMSTI。我怎样才能获得光彩?有地图文件吗?或者使用Nltk WordNet?
答案 0 :(得分:3)
import re
from nltk.corpus import wordnet as wn
sense_key_regex = r"(.*)\%(.*):(.*):(.*):(.*):(.*)"
synset_types = {1:'n', 2:'v', 3:'a', 4:'r', 5:'s'}
def synset_from_sense_key(sense_key):
lemma, ss_type, lex_num, lex_id, head_word, head_id = re.match(sense_key_regex, sense_key).groups()
ss_idx = '.'.join([lemma, synset_types[int(ss_type)], lex_id])
return wn.synset(ss_idx)
x = "long%3:00:02::"
synset_from_sense_key(x)
NLTK中有这个非常钝的功能。但是,这不是从感知键读取而是从data_file_map
读取(例如“data.adj”,“data.noun”等):https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1355
由于我们已经在NTLK中拥有一个只有凡人可理解的API,并且有https://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html的一些指南,
A sense_key is represented as:
lemma % lex_sense
where lex_sense is encoded as:
ss_type:lex_filenum:lex_id:head_word:head_id
(yada, yada...)
The synset type is encoded as follows:
1 NOUN
2 VERB
3 ADJECTIVE
4 ADVERB
5 ADJECTIVE SATELLITE
我们可以使用正则表达式https://regex101.com/r/9KlVK7/1/执行此操作:
>>> import re
>>> sense_key_regex = r"(.*)\%(.*):(.*):(.*):(.*):(.*)"
>>> x = "long%3:00:02::"
>>> re.match(sense_key_regex, x)
<_sre.SRE_Match object at 0x10061ad78>
>>> re.match(sense_key_regex, x).groups()
('long', '3', '00', '02', '', '')
>>> lemma, ss_type, lex_num, lex_id, head_word, head_id = re.match(sense_key_regex, x).groups()
>>> synset_types = {1:'n', 2:'v', 3:'a', 4:'r', 5:'s'}
>>> '.'.join([lemma, synset_types[int(ss_type)], lex_id])
'long.a.02'
瞧,你从感知键获得了NLTK Synset()
对象=)
>>> from nltk.corpus import wordnet as wn
>>> wn.synset(idx)
Synset('long.a.02')
答案 1 :(得分:0)
我通过下载解决了这个问题。 http://wordnet.princeton.edu/glosstag.shtml 使用WordNet-3.0 \ glosstag \ merged中的文件创建自己的地图dic。