Question

我正在预先分发数据以从Graphlab运行KMEAMS，并且遇到以下错误：

'.states click' : function(el, e) {

    if ( !$(e.target).hasClass('.states') ) return false;

    // your handler code here

}

以下是每列的当前数据类型：

 tmp = data.select_columns(['a.item_id'])
 tmp['sku'] = tmp['a.item_id'].apply(lambda x: x.split(','))
 tmp = tmp.unpack('sku')

 kmeans_model = gl.kmeans.create(tmp, num_clusters=K)

 Feature 'sku.0' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.
 Feature 'sku.1' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.

如果我可以从str到int获取数据类型，我认为它应该可行。但是，使用SFrame比标准的python库更棘手。任何帮助到达那里表示赞赏。

Answer 1

kmeans模型确实允许字典形式的功能，而不是列表形式。这与你现在的情况略有不同，因为字典丢失了SKU的顺序，但就模型质量而言，我怀疑它实际上更有意义。它们的关键功能是count_words，位于文本分析工具包中。

https://dato.com/products/create/docs/generated/graphlab.text_analytics.count_words.html

import graphlab as gl
sf = gl.SFrame({'item_id': ['abc,xyz,cat', 'rst', 'abc,dog']})
sf['sku_count'] = gl.text_analytics.count_words(sf['item_id'], delimiters=[','])

model = gl.kmeans.create(sf, num_clusters=2, features=['sku_count'])
print model.cluster_id  

+--------+------------+----------------+
| row_id | cluster_id |    distance    |
+--------+------------+----------------+
|   0    |     1      | 0.866025388241 |
|   1    |     0      |      0.0       |
|   2    |     1      | 0.866025388241 |
+--------+------------+----------------+
[3 rows x 3 columns]

SFrame Kmeans - 隐蔽到Int，Float，Dict

1 个答案: