在数据库级别使用混合变量实现kmeans

时间:2017-08-31 12:25:26

标签: r cluster-analysis k-means similarity netezza

我有一个包含不同数据类型的列的表(如:ProductId,Name,size,color,class,dept等列),因为并非所有列都是数字的,我如何将类似的产品聚合在一起。数据位于Netezza,为了快速处理,我想在DB端进行,因为数据量很大(大约200万行)。

我试图在R中实现Gower的相似性,但这需要花费很多时间。我可以在netezza端使用UDF吗?

  
    

dput(头(PROD))

  
     

结构(列表(Product_key = c(" 136220083"," 134520094"," 137520230",   " 133420231"," 137420204"," 136520284"),SRO_score = c(2,2,2,3,3,   1),PRDF_SKU_NAME = c(" 1496533"," 1496534"," 1496537"," 1496540",   " 1496541"," 1496542"),ATTRIB_VAL1 = c(" Champion Canvas","冠军   Canvas"," Champion Canvas"," Champion Canvas"," Champion Canvas",   " Champion Canvas"),ATTRIB_VAL2 = c(" Navy Canvas"," Navy Canvas",   " Red"," Red"," Red"," Red"),ATTRIB_VAL3 = c(" 9.5W" ," 10W"," 7W",   " 8.5W"," 9W"," 9.5W"),ATTRIB_VAL4 = c(" Keds"," Keds&#34 ;," Keds",   " Keds"," Keds"," Keds"),ATTRIB_VAL5 = c(" VULCANIZED FOOTWEAR",   "硫化鞋#34;,"硫化鞋#34;,"硫化鞋#34;   " VULCANIZED FOOTWEAR"," VULCANIZED FOOTWEAR"),ATTRIB_VAL6 = c(" WOMENS   体育传统","女子体育传统","女子体育   传统","女士体育传统","女士体育传统",   " WOMENS SPORT TRADITIONAL"),ATTRIB_VAL7 = c(" 1.38 lb"," 1.38 lb",   " 1.38 lb"," 1.38 lb"," 1.38 lb"," 1.38 lb"),ATTRIB_VAL8 = c(" SHOES   WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES   WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES WOMENS SPORT"),   ATTRIB_VAL9 = c("女士鞋","女士鞋","女士鞋",   "女士鞋","女士鞋"," WOMENS SHOES"))。。Name =   c(" Product_key"," SRO_score"," PRDF_SKU_NAME"," ATTRIB_VAL1",   " ATTRIB_VAL2"," ATTRIB_VAL3"," ATTRIB_VAL4"," ATTRIB_VAL5",   " ATTRIB_VAL6"," ATTRIB_VAL7"," ATTRIB_VAL8"," ATTRIB_VAL9"),row.names   = c(4107L,3927L,4260L,3794L,4246L,4140L),class =" data.frame")

1 个答案:

答案 0 :(得分:0)

你不能只使用具有Gower相似性的k-means。

K-means也需要计算意味着

通常的选择是PAM,但这种情况可怕。您不想在完整数据集上使用它。

不是缩放到整个数据集,而是首先使用示例学习要做什么。群集权利困难。您需要将90%的时间花在预处理上。

首先找出有效的方法。然后缩放。不是相反。