如何使用NLTK WordNet检查Python中的不完整单词?

时间:2014-03-11 14:50:07

标签: python nltk wordnet

我有一套词:

  {p> {下士,狗,猫,distingus,公司,电话,权威,vhicule,座位,   轻量级,规则,居民,专业知识}

我想计算前一组中每个单词之间的语义相似度。我有一个问题:

  1. 有些单词并不完整,因为“vhicule”。我怎么能忽略这些话?

  2. 示例代码:Python: Passing variables into Wordnet Synsets methods in NLTK

    import nltk.corpus as corpus
    import itertools as IT
    import fileinput
    
    if __name__=="__main__":
        wordnet = corpus.wordnet
        list1 = ["apple", "honey", "drinks", "flowers", "paper"]
        list2 = ["pear", "shell", "movie", "fire", "tree"]
    
        for word1, word2 in IT.product(list1, list2):
            #print(word1, word2)
            wordFromList1 = wordnet.synsets(word1)[0]
            wordFromList2 = wordnet.synsets(word2)[0]
            print('{w1}, {w2}: {s}'.format(
                w1 = wordFromList1.name,
                w2 = wordFromList2.name,
                s = wordFromList1.wup_similarity(wordFromList2)))
    

    假设我将“vhicule”添加到任何列表中。我收到以下错误:

      

    IndexError:列表索引超出范围

    如何使用此错误忽略数据库中不存在的单词?

1 个答案:

答案 0 :(得分:3)

您可以检查nltk.corpus.wordnet.synsets(i)是否返回同义词列表:

>>> from nltk.corpus import wordnet as wn
>>> x = [i.strip() for i in """corporal, dog, cat, distingus, Company, phone, authority, vhicule, seats, lightweight, rules, resident, expertise""".lower().split(",")]
>>> x
['corporal', 'dog', 'cat', 'distingus', 'company', 'phone', 'authority', 'vhicule', 'seats', 'lightweight', 'rules', 'resident', 'expertise']
>>> y = [i for i in x if len(wn.synsets(i)) > 0]
>>> y
['corporal', 'dog', 'cat', 'company', 'phone', 'authority', 'seats', 'lightweight', 'rules', 'resident', 'expertise']

更简洁的方法是检查wn.synsets(i)是否None

>>> from nltk.corpus import wordnet as wn
>>> x = [i.strip() for i in """corporal, dog, cat, distingus, Company, phone, authority, vhicule, seats, lightweight, rules, resident, expertise""".lower().split(",")]
>>> x
['corporal', 'dog', 'cat', 'distingus', 'company', 'phone', 'authority', 'vhicule', 'seats', 'lightweight', 'rules', 'resident', 'expertise']
>>> [i for i in x if wn.synsets(i)]
['corporal', 'dog', 'cat', 'company', 'phone', 'authority', 'seats', 'lightweight', 'rules', 'resident', 'expertise']