NLTK词汇中缺少单词 - Python

时间:2016-01-13 01:08:37

标签: python nltk

我正在测试NLTK package的词汇量。我使用了以下代码,并希望看到所有True

import nltk

english_vocab = set(w.lower() for w in nltk.corpus.words.words())

print ('answered' in english_vocab)
print ('unanswered' in english_vocab)
print ('altered' in english_vocab)
print ('alter' in english_vocab)
print ('looks' in english_vocab)
print ('look' in english_vocab)

但是我的结果如下,很多单词都缺失了,或者说某些形式的单词缺失了?我错过了什么吗?

False
True
False
True
False
True

2 个答案:

答案 0 :(得分:3)

实际上,语料库不是所有英语单词的详尽列表,而是一组文本。告诉单词是否是有效英语单词的更合适的方法是使用wordnet:

from nltk.corpus import wordnet as wn

print wn.synsets('answered')
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')]

print wn.synsets('unanswered')
# [Synset('unanswered.s.01')]

print wn.synsets('notaword')
# []

答案 1 :(得分:2)

NLTK corpora实际上不存储每个单词,它们被定义为"大量文本"。

例如,您使用的是words语料库,我们可以使用readme()方法检查其定义:

>>> print(nltk.corpus.words.readme())
Wordlists

en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)

Unix的话并非详尽无遗,因此可能确实遗漏了一些词。语料库本质上是不完整的(因此强调自然语言)。

话虽这么说,您可能想尝试使用源自字典的语料库,例如brown

>>> print(nltk.corpus.brown.readme())
BROWN CORPUS

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.

by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA

Revised 1971, Revised and Amplified 1979

http://www.hit.uib.no/icame/brown/bcm.html

Distributed with the permission of the copyright holder, redistribution permitted.