将WordNet文件转换为.txt

时间:2014-05-27 19:54:42

标签: file text nlp wordnet

我需要将WordNet数据库文件(noun.shape,noun.state,verb.cognition ecc)从其自定义扩展名转换为.txt,以便更轻松地提取其名词,动词,形容词和副词在他们的自定义类别。 换句话说,在“DATABASE FILES ONLY”中你会找到我正在寻找的文件,不幸的是它们有.STATE或.SHAPE扩展名。它们在记事本中是可读的,但我需要一个包含这些文件中所有项目的列表,而不在括号中定义。

1 个答案:

答案 0 :(得分:1)

如果您只是将WordNet用作字典,可以尝试Open Multilingual WordNet,请参阅http://compling.hss.ntu.edu.sg/omw/

import os, codecs

from nltk.corpus import wordnet as wn

# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
  reader = codecs.open(wnfile, "r", "utf8").readlines()
  wn = {}
  for l in reader:
    if l[0] == "#": continue
    if option=="ss":
      k = l.split("\t")[0] #ss as key
      v = l.split("\t")[2][:-1] #word
    else:
      v = l.split("\t")[0] #ss as value
      k = l.split("\t")[2][:-1] #word as key
    try:
      temp = wn[k]
      wn[k] = temp + ";" + v
    except KeyError:
      wn[k] = v  
  return wn

if not os.path.exists('msa/wn-data-zsm.tab'):
    os.system('wget http://compling.hss.ntu.edu.sg/omw/wns/zsm.zip')
    os.system('unzip zsm.zip')

msa_wn = readWNfile('msa/wn-data-zsm.tab')
eng_wn_keys = {(str(i.offset).zfill(8) + '-'+i.pos).decode('utf8'):i for i in wn.all_synsets()}

for i in set(eng_wn_keys).intersection(msa_wn.keys()):
    print eng_wn_keys[i], msa_wn[i]

与此同时,请坚持一段时间,因为NLTK开发人员将很快将Open Multilingual Wordnet API放在一起,请参阅第1048行的https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py