我需要将WordNet数据库文件(noun.shape,noun.state,verb.cognition ecc)从其自定义扩展名转换为.txt,以便更轻松地提取其名词,动词,形容词和副词在他们的自定义类别。 换句话说,在“DATABASE FILES ONLY”中你会找到我正在寻找的文件,不幸的是它们有.STATE或.SHAPE扩展名。它们在记事本中是可读的,但我需要一个包含这些文件中所有项目的列表,而不在括号中定义。
答案 0 :(得分:1)
如果您只是将WordNet用作字典,可以尝试Open Multilingual WordNet
,请参阅http://compling.hss.ntu.edu.sg/omw/
import os, codecs
from nltk.corpus import wordnet as wn
# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
reader = codecs.open(wnfile, "r", "utf8").readlines()
wn = {}
for l in reader:
if l[0] == "#": continue
if option=="ss":
k = l.split("\t")[0] #ss as key
v = l.split("\t")[2][:-1] #word
else:
v = l.split("\t")[0] #ss as value
k = l.split("\t")[2][:-1] #word as key
try:
temp = wn[k]
wn[k] = temp + ";" + v
except KeyError:
wn[k] = v
return wn
if not os.path.exists('msa/wn-data-zsm.tab'):
os.system('wget http://compling.hss.ntu.edu.sg/omw/wns/zsm.zip')
os.system('unzip zsm.zip')
msa_wn = readWNfile('msa/wn-data-zsm.tab')
eng_wn_keys = {(str(i.offset).zfill(8) + '-'+i.pos).decode('utf8'):i for i in wn.all_synsets()}
for i in set(eng_wn_keys).intersection(msa_wn.keys()):
print eng_wn_keys[i], msa_wn[i]
与此同时,请坚持一段时间,因为NLTK开发人员将很快将Open Multilingual Wordnet API放在一起,请参阅第1048行的https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py