在Python中使用h2o4gpu K-Means对文本文档进行聚类

时间:2018-07-30 14:29:06

标签: python python-3.x k-means h2o4gpu

我对使用h2o4gpu对文本文档进行聚类感兴趣。作为参考,我遵循了this tutorial,但是更改了代码以反映h2o4gpu。

from sklearn.feature_extraction.text import TfidfVectorizer
import h2o4gpu

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 2
model = h2o4gpu.KMeans(n_gpus=1, n_clusters=true_k, init='k-means++', 
max_iter=100, n_init=1)
model.fit(X)

但是,当运行上面的代码示例时,我收到以下错误:

Traceback (most recent call last):
File "dev.py", line 20, in <module>
model.fit(X)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/kmeans.py", line 810, in fit
res = self.model.fit(X, y)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/kmeans.py", line 303, in fit
X_np, _, _, _, _, _ = _get_data(X, ismatrix=True)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/utils.py", line 119, in _get_data
data, ismatrix=ismatrix, dtype=dtype, order=order)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/utils.py", line 79, in _to_np
outdata = outdata.astype(dtype, copy=False, order=nporder)
ValueError: setting an array element with a sequence.

我已经搜索了h2o4gpu.feature_extraction.text.TfidfVectorizer,但尚未在h2o4gpu中找到它。也就是说,有没有办法纠正这个问题?

软件版本

  • CUDA 9.0,V9.0.176

  • cuDNN 7.1.3

  • Python 3.6.4

  • h2o4gpu 0.2.0

  • Scikit-Learn 0.19.1

1 个答案:

答案 0 :(得分:1)

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

返回稀疏矩阵对象scipy.sparse.csr_matrix

目前在H2O4GPU中,我们仅支持KMeans的密集表示。这意味着您必须将X转换为2D Python原始列表或2D Numpy数组,以0填充丢失的元素。

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()

true_k = 2
model = h2o4gpu.KMeans(n_gpus=1, n_clusters=true_k, init='k-means++', 
max_iter=100, n_init=1)
model.fit(X_dense)

应该做到这一点。对于NLP来说,这不是一个最佳解决方案,因为它可能需要更多的内存,但在路线图上我们还没有对KMeans的稀疏支持。