我有一个包含不同数据类型的列的表(如:ProductId,Name,size,color,class,dept等列),因为并非所有列都是数字的,我如何将类似的产品聚合在一起。数据位于Netezza,为了快速处理,我想在DB端进行,因为数据量很大(大约200万行)。
我试图在R中实现Gower的相似性,但这需要花费很多时间。我可以在netezza端使用UDF吗?
dput(头(PROD))
结构(列表(Product_key = c(" 136220083"," 134520094"," 137520230", " 133420231"," 137420204"," 136520284"),SRO_score = c(2,2,2,3,3, 1),PRDF_SKU_NAME = c(" 1496533"," 1496534"," 1496537"," 1496540", " 1496541"," 1496542"),ATTRIB_VAL1 = c(" Champion Canvas","冠军 Canvas"," Champion Canvas"," Champion Canvas"," Champion Canvas", " Champion Canvas"),ATTRIB_VAL2 = c(" Navy Canvas"," Navy Canvas", " Red"," Red"," Red"," Red"),ATTRIB_VAL3 = c(" 9.5W" ," 10W"," 7W", " 8.5W"," 9W"," 9.5W"),ATTRIB_VAL4 = c(" Keds"," Keds&#34 ;," Keds", " Keds"," Keds"," Keds"),ATTRIB_VAL5 = c(" VULCANIZED FOOTWEAR", "硫化鞋#34;,"硫化鞋#34;,"硫化鞋#34; " VULCANIZED FOOTWEAR"," VULCANIZED FOOTWEAR"),ATTRIB_VAL6 = c(" WOMENS 体育传统","女子体育传统","女子体育 传统","女士体育传统","女士体育传统", " WOMENS SPORT TRADITIONAL"),ATTRIB_VAL7 = c(" 1.38 lb"," 1.38 lb", " 1.38 lb"," 1.38 lb"," 1.38 lb"," 1.38 lb"),ATTRIB_VAL8 = c(" SHOES WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES WOMENS SPORT"," SHOES WOMENS SPORT"), ATTRIB_VAL9 = c("女士鞋","女士鞋","女士鞋", "女士鞋","女士鞋"," WOMENS SHOES"))。。Name = c(" Product_key"," SRO_score"," PRDF_SKU_NAME"," ATTRIB_VAL1", " ATTRIB_VAL2"," ATTRIB_VAL3"," ATTRIB_VAL4"," ATTRIB_VAL5", " ATTRIB_VAL6"," ATTRIB_VAL7"," ATTRIB_VAL8"," ATTRIB_VAL9"),row.names = c(4107L,3927L,4260L,3794L,4246L,4140L),class =" data.frame")
答案 0 :(得分:0)
你不能只使用具有Gower相似性的k-means。
K-means也需要计算意味着。
通常的选择是PAM,但这种情况可怕。您不想在完整数据集上使用它。
不是缩放到整个数据集,而是首先使用示例学习要做什么。群集权利困难。您需要将90%的时间花在预处理上。
首先找出有效的方法。然后缩放。不是相反。