Question

通过将自然语言工具包与WordNet语料库进行比较，我有一个简短的函数来检查单词是否是真正的单词。我从验证txt文件的线程调用此函数。当我运行我的代码时，第一次调用该函数时会抛出带有消息

的AttributeError

"'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'"

当我暂停执行时，同一行代码不会引发错误，因此我假设在我第一次调用时尚未加载语料库导致错误。

我尝试使用nltk.wordnet.ensure_loaded()强行加载语料库，但我仍然遇到同样的错误。

这是我的功能：

from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.corpus.reader.wordnet import WordNetError
import sys

cachedStopWords = stopwords.words("english")

def is_good_word(word):
    word = word.strip()
    if len(word) <= 2:
        return 0
    if word in cachedStopWords:
        return 0
    try:
        wn.ensure_loaded()
        if len(wn.lemmas(str(word), lang='en')) == 0:
            return 0
    except WordNetError as e:
        print "WordNetError on concept {}".format(word)
    except AttributeError as e:
        print "Attribute error on concept {}: {}".format(word, e.message)
    except:
        print "Unexpected error on concept {}: {}".format(word, sys.exc_info()[0])
    else:
        return 1
    return 1

print (is_good_word('dog')) #Does NOT throw error

如果我在全局范围内的同一文件中有print语句，则不会抛出错误。但是，如果我从我的线程中调用它，它确实如此。以下是重现错误的最小示例。我已对它进行了测试，并在我的机器上提供输出

Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'

最小例子：

import time
import threading
from filter_tag import is_good_word

class ProcessMetaThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        is_good_word('dog') #Throws error


def process_meta(numberOfThreads):

    threadsList = []
    for i in range(numberOfThreads):
        t = ProcessMetaThread()
        t.setDaemon(True)
        t.start()
        threadsList.append(t)

    numComplete = 0
    while numComplete < numberOfThreads:
        # Iterate over the active processes
        for processNum in range(0, numberOfThreads):
            # If a process actually exists
            if threadsList != None:
                # If the process is finished
                if not threadsList[processNum] == None:
                    if not threadsList[processNum].is_alive():
                        numComplete += 1
                        threadsList[processNum] = None
        time.sleep(5)

    print 'Processes Finished'


if __name__ == '__main__':
    process_meta(10)

Answer 1

我运行了你的代码并得到了同样的错误。有关可行的解决方案，请参见下文。以下是解释：

LazyCorpusLoader是一个代理对象，在加载语料库之前代表语料库对象。（这可以防止NLTK在您需要之前将大量语料库加载到内存中。）但是，第一次访问此代理对象时，将成为您要加载的语料库。也就是说，LazyCorpusLoader代理对象将其__dict__和__class__转换为您正在加载的语料库的__dict__和__class__。

如果您将代码与上述错误进行比较，则可以看到在尝试创建类的10个实例时收到了9个错误。 LazyCorpusLoader代理对象到WordNetCorpusReader对象的第一次转换是成功的。当您第一次访问wordnet时触发了此操作：

第一线程

from nltk.corpus import wordnet as wn
def is_good_word(word):
    ...
    wn.ensure_loaded()  # `LazyCorpusLoader` conversion into `WordNetCorpusReader` starts

第二个帖子

但是，当您开始在第二个线程中运行is_good_word函数时，您的第一个线程尚未将LazyCorpusLoader代理对象完全转换为WordNetCorpusReader。 wn仍然是LazyCorpusLoader代理对象，因此它会再次开始__load进程。但是，一旦它尝试将其__class__和__dict__转换为WordNetCorpusReader对象，第一个线程已将LazyCorpusLoader代理对象转换为WordNetCorpusReader。我的猜测是你在下面的评论中遇到了错误：

class LazyCorpusLoader(object):
    ...
    def __load(self):
        ...
        corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)  # load corpus
        ...
        # self.__args == self._LazyCorpusLoader__args
        args, kwargs  = self.__args, self.__kwargs                       # most likely the line throwing the error

第一个线程将LazyCorpusLoader代理对象转换为WordNetCorpusReader对象后，受损的名称将不再起作用。 WordNetCorpusReader对象在其错位名称中的任何位置都不会有LazyCorpusLoader。（self.__args等同于self._LazyCorpusLoader__args，而对象是LazyCorpusLoader对象。）因此，您会收到以下错误：

AttributeError: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'

另类

针对此问题，您需要在进入线程之前访问wn对象。以下是适当修改的代码：

from nltk.corpus import wordnet as wn from nltk.corpus import stopwords from nltk.corpus.reader.wordnet import WordNetError import sys import time import threading cachedStopWords = stopwords.words("english") def is_good_word(word): word = word.strip() if len(word) <= 2: return 0 if word in cachedStopWords: return 0 try: if len(wn.lemmas(str(word), lang='en')) == 0: # no longer the first access of wn return 0 except WordNetError as e: print("WordNetError on concept {}".format(word)) except AttributeError as e: print("Attribute error on concept {}: {}".format(word, e.message)) except: print("Unexpected error on concept {}: {}".format(word, sys.exc_info()[0])) else: return 1 return 1 class ProcessMetaThread(threading.Thread): def __init__(self): threading.Thread.__init__(self) def run(self): is_good_word('dog') def process_meta(numberOfThreads): print wn.__class__ # <class 'nltk.corpus.util.LazyCorpusLoader'> wn.ensure_loaded() # first access to wn transforms it print wn.__class__ # <class 'nltk.corpus.reader.wordnet.WordNetCorpusReader'> threadsList = [] for i in range(numberOfThreads): start = time.clock() t = ProcessMetaThread() print time.clock() - start t.setDaemon(True) t.start() threadsList.append(t) numComplete = 0 while numComplete < numberOfThreads: # Iterate over the active processes for processNum in range(0, numberOfThreads): # If a process actually exists if threadsList != None: # If the process is finished if not threadsList[processNum] == None: if not threadsList[processNum].is_alive(): numComplete += 1 threadsList[processNum] = None time.sleep(5) print('Processes Finished') if __name__ == '__main__': process_meta(10)

我测试了上面的代码并没有收到任何错误。

Answer 2

我最近在尝试使用 wordnet 同义词集时遇到了这个问题，我找到了另一种解决方法。我正在运行一个带有 fast-api 和 uvicorn 的应用程序，它需要每秒处理成千上万的请求。尝试了许多不同的解决方案，但最后最好的方法是将 wordnet 同义词集移动到单独的字典中。它在5秒内增加了服务器启动时间（并没有消耗那么多内存），但当然，这种方式读取数据时的性能是一流的。

from nltk.corpus import wordnet
from itertools import chain

def get_synsets(self, word):
    try:
        return synsets[word.lower()]
    except:
        return []

synsets = {}
lemmas_in_wordnet = set(chain(*[x.lemma_names() for x in wordnet.all_synsets()]))

for lemma in lemmas_in_wordnet:
    synsets[lemma] = wordnet.synsets(lemma)

性能

timeit("synsets['word']", globals=globals(), number=1000000)
0.03934049699999953

timeit("wordnet.synsets('word')", globals=globals(), number=1000000)
11.193742591000001

什么会导致WordNetCorpusReader没有属性LazyCorpusLoader？

2 个答案: