通过将自然语言工具包与WordNet语料库进行比较,我有一个简短的函数来检查单词是否是真正的单词。我从验证txt文件的线程调用此函数。当我运行我的代码时,第一次调用该函数时会抛出带有消息
的AttributeError"'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'"
当我暂停执行时,同一行代码不会引发错误,因此我假设在我第一次调用时尚未加载语料库导致错误。
我尝试使用nltk.wordnet.ensure_loaded()
强行加载语料库,但我仍然遇到同样的错误。
这是我的功能:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.corpus.reader.wordnet import WordNetError
import sys
cachedStopWords = stopwords.words("english")
def is_good_word(word):
word = word.strip()
if len(word) <= 2:
return 0
if word in cachedStopWords:
return 0
try:
wn.ensure_loaded()
if len(wn.lemmas(str(word), lang='en')) == 0:
return 0
except WordNetError as e:
print "WordNetError on concept {}".format(word)
except AttributeError as e:
print "Attribute error on concept {}: {}".format(word, e.message)
except:
print "Unexpected error on concept {}: {}".format(word, sys.exc_info()[0])
else:
return 1
return 1
print (is_good_word('dog')) #Does NOT throw error
如果我在全局范围内的同一文件中有print语句,则不会抛出错误。但是,如果我从我的线程中调用它,它确实如此。以下是重现错误的最小示例。我已对它进行了测试,并在我的机器上提供输出
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
最小例子:
import time
import threading
from filter_tag import is_good_word
class ProcessMetaThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
is_good_word('dog') #Throws error
def process_meta(numberOfThreads):
threadsList = []
for i in range(numberOfThreads):
t = ProcessMetaThread()
t.setDaemon(True)
t.start()
threadsList.append(t)
numComplete = 0
while numComplete < numberOfThreads:
# Iterate over the active processes
for processNum in range(0, numberOfThreads):
# If a process actually exists
if threadsList != None:
# If the process is finished
if not threadsList[processNum] == None:
if not threadsList[processNum].is_alive():
numComplete += 1
threadsList[processNum] = None
time.sleep(5)
print 'Processes Finished'
if __name__ == '__main__':
process_meta(10)
答案 0 :(得分:15)
我运行了你的代码并得到了同样的错误。有关可行的解决方案,请参见下文。以下是解释:
LazyCorpusLoader
是一个代理对象,在加载语料库之前代表语料库对象。 (这可以防止NLTK在您需要之前将大量语料库加载到内存中。)但是,第一次访问此代理对象时,将成为您要加载的语料库。也就是说,LazyCorpusLoader
代理对象将其__dict__
和__class__
转换为您正在加载的语料库的__dict__
和__class__
。
如果您将代码与上述错误进行比较,则可以看到在尝试创建类的10个实例时收到了9个错误。 LazyCorpusLoader
代理对象到WordNetCorpusReader
对象的第一次转换是成功的。当您第一次访问wordnet时触发了此操作:
第一线程
from nltk.corpus import wordnet as wn
def is_good_word(word):
...
wn.ensure_loaded() # `LazyCorpusLoader` conversion into `WordNetCorpusReader` starts
第二个帖子
但是,当您开始在第二个线程中运行is_good_word
函数时,您的第一个线程尚未将LazyCorpusLoader
代理对象完全转换为WordNetCorpusReader
。 wn
仍然是LazyCorpusLoader
代理对象,因此它会再次开始__load
进程。但是,一旦它尝试将其__class__
和__dict__
转换为WordNetCorpusReader
对象,第一个线程已将LazyCorpusLoader
代理对象转换为WordNetCorpusReader
。我的猜测是你在下面的评论中遇到了错误:
class LazyCorpusLoader(object):
...
def __load(self):
...
corpus = self.__reader_cls(root, *self.__args, **self.__kwargs) # load corpus
...
# self.__args == self._LazyCorpusLoader__args
args, kwargs = self.__args, self.__kwargs # most likely the line throwing the error
第一个线程将LazyCorpusLoader
代理对象转换为WordNetCorpusReader
对象后,受损的名称将不再起作用。 WordNetCorpusReader
对象在其错位名称中的任何位置都不会有LazyCorpusLoader
。 (self.__args
等同于self._LazyCorpusLoader__args,而对象是LazyCorpusLoader
对象。)因此,您会收到以下错误:
AttributeError: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
另类
针对此问题,您需要在进入线程之前访问wn
对象。以下是适当修改的代码:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.corpus.reader.wordnet import WordNetError
import sys
import time
import threading
cachedStopWords = stopwords.words("english")
def is_good_word(word):
word = word.strip()
if len(word) <= 2:
return 0
if word in cachedStopWords:
return 0
try:
if len(wn.lemmas(str(word), lang='en')) == 0: # no longer the first access of wn
return 0
except WordNetError as e:
print("WordNetError on concept {}".format(word))
except AttributeError as e:
print("Attribute error on concept {}: {}".format(word, e.message))
except:
print("Unexpected error on concept {}: {}".format(word, sys.exc_info()[0]))
else:
return 1
return 1
class ProcessMetaThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
is_good_word('dog')
def process_meta(numberOfThreads):
print wn.__class__ # <class 'nltk.corpus.util.LazyCorpusLoader'>
wn.ensure_loaded() # first access to wn transforms it
print wn.__class__ # <class 'nltk.corpus.reader.wordnet.WordNetCorpusReader'>
threadsList = []
for i in range(numberOfThreads):
start = time.clock()
t = ProcessMetaThread()
print time.clock() - start
t.setDaemon(True)
t.start()
threadsList.append(t)
numComplete = 0
while numComplete < numberOfThreads:
# Iterate over the active processes
for processNum in range(0, numberOfThreads):
# If a process actually exists
if threadsList != None:
# If the process is finished
if not threadsList[processNum] == None:
if not threadsList[processNum].is_alive():
numComplete += 1
threadsList[processNum] = None
time.sleep(5)
print('Processes Finished')
if __name__ == '__main__':
process_meta(10)
我测试了上面的代码并没有收到任何错误。
答案 1 :(得分:0)
我最近在尝试使用 wordnet 同义词集时遇到了这个问题,我找到了另一种解决方法。 我正在运行一个带有 fast-api 和 uvicorn 的应用程序,它需要每秒处理成千上万的请求。尝试了许多不同的解决方案,但最后最好的方法是将 wordnet 同义词集移动到单独的字典中。 它在5秒内增加了服务器启动时间(并没有消耗那么多内存),但当然,这种方式读取数据时的性能是一流的。
from nltk.corpus import wordnet
from itertools import chain
def get_synsets(self, word):
try:
return synsets[word.lower()]
except:
return []
synsets = {}
lemmas_in_wordnet = set(chain(*[x.lemma_names() for x in wordnet.all_synsets()]))
for lemma in lemmas_in_wordnet:
synsets[lemma] = wordnet.synsets(lemma)
性能
timeit("synsets['word']", globals=globals(), number=1000000)
0.03934049699999953
timeit("wordnet.synsets('word')", globals=globals(), number=1000000)
11.193742591000001