我正在使用此gensim教程来查找文本之间的相似之处。这是代码
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
"red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#print corpus
tfidf = models.TfidfModel(corpus)
#print tfidf
corpus_tfidf = tfidf[corpus]
#print corpus_tfidf
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)
corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')
sims = index[vec_lsi]
#print list(enumerate(sims))
sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
print documents[sim[0]], " ==> ", sim[1]
这里有两个文件。一个有10个文本,另一个有2个。一个被注释掉了。如果我使用第一个文档列表,一切都很顺利,并产生有意义的输出。如果我使用第二个文档列表(有2个文本),则会发生错误。这是它
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name )
此错误背后的原因是什么?我该如何解决? 我使用的是64位机器。
答案 0 :(得分:2)
这可能是因为当您删除单例时,第二个列表将为[[], ['water']]
,尝试对尺寸为0和1的矩阵进行矩阵运算可能会导致各种问题。
玩你的代码:
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpus
[[], [(0, 2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:23:31,415 : INFO : collecting document frequencies
2013-07-21 09:23:31,415 : INFO : PROGRESS: processing document #0
2013-07-21 09:23:31,415 : INFO : calculating IDF weights for 2 documents and 1 features (1 matrix non-zeros)
>>> corpus = [[(1,)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:16,452 : INFO : collecting document frequencies
2013-07-21 09:24:16,452 : INFO : PROGRESS: processing document #0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
self.initialize(corpus)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 119, in initialize
for termid, _ in bow:
ValueError: need more than 1 value to unpack
>>> corpus = [[(1,3)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:26,892 : INFO : collecting document frequencies
2013-07-21 09:24:26,892 : INFO : PROGRESS: processing document #0
2013-07-21 09:24:26,892 : INFO : calculating IDF weights for 2 documents and 2 features (2 matrix non-zeros)
>>>
正如我上面所说,你需要确保corpus
不有任何空列表,然后再调用models.TfidfModel(corpus)
。
答案 1 :(得分:0)
这不是错误,而是一个警告。你可以忽略它。
在第二种情况下,您的查询文档doc
为空,这会导致警告。无论如何你仍然得到正确的答案(=空vec_lsi
)。