我正在尝试使用nltk中的gaac对文本文档进行聚类。代码如下。
from nltk.corpus import PlaintextCorpusReader
from nltk.cluster import GAAClusterer
from gensim import corpora
import numpy
import gensim
import itertools
filepath='C:\ISSS609\Forum'
corpus=PlaintextCorpusReader(filepath,'.*')
fids=corpus.fileids()
docs=[corpus.words(f) for f in fids]
dictionary=corpora.Dictionary(docs)
vec=[dictionary.doc2bow(doc) for doc in docs]
vec2=list(itertools.chain(*vec))
vectors = [numpy.array(f) for f in vec2]
clusterer = GAAClusterer()
clusterer.cluster(vectors,False)
clusterer.dendrogram()
下面是我得到的错误
Traceback (most recent call last):
File "C:/Users/Aditya/PycharmProjects/untitled1/test2.py", line 21, in <module>
clusterer.cluster(vectors,False)
File "C:\Python34\lib\site-packages\nltk\cluster\gaac.py", line 41, in cluster
return VectorSpaceClusterer.cluster(self, vectors, assign_clusters, trace)
File "C:\Python34\lib\site-packages\nltk\cluster\util.py", line 57, in cluster
self.cluster_vectorspace(vectors, trace)
File "C:\Python34\lib\site-packages\nltk\cluster\gaac.py", line 52, in cluster_vectorspace
dist = numpy.ones(dims, dtype=numpy.float)*numpy.inf
File "C:\Python34\lib\site-packages\numpy\core\numeric.py", line 178, in ones
a = empty(shape, dtype, order)
ValueError: array is too big.
请建议一个解决方法。