R搜索相似的稀疏矩阵

时间:2017-11-01 06:20:14

标签: r sparse-matrix

我有一个包含产品相似性数据的稀疏出现矩阵。 所有产品在x和y上以相同的顺序出现,值为1表示产品是相同的,无论值为0表示产品是否不同。

如下:

P1  P2  P3  P4
P1  1   1   0   0
P2  0   1   0   1
P3  0   0   1   1
P4  0   1   0   1

在这种情况下,P1类似于自身和P2,但P2类似于P4。所以最后P1,P2和P4都是一样的。 我需要在R中写一些能为P1,P2和P4分配相同代码的东西:

Product_Name  Ref_Code 
     P1          P1
     P2          P1
     P3          P3
     P4          P1

是否可以在R?

中进行

干杯,

的Dario。

2 个答案:

答案 0 :(得分:1)

我同意@Prem,根据您的逻辑,所有产品都是相同的。我已经使用reshape2包提供了一个代码示例,以便将您的产品放入长格式。即使您的相似性度量不会在产品之间产生任何差异,您也可以使用melt()的输出,以不同的方式对数据进行排序和过滤,从而达到您想要的效果。

library(reshape2)

data <- read.table ( text = "P1  P2  P3  P4
                          P1  1   1   0   0
                          P2  0   1   0   1
                          P3  0   0   1   1
                          P4  0   1   0   1"
                          , header = TRUE, stringsAsFactors = FALSE)


data <-cbind(rownames(data), data)
names(data)[1] <- "product1"

data.melt <- melt(data
             , id.vars = "product1"
             , measure.vars = colnames(data)[2:ncol(data)]
             , variable.name = "product2"
             , value.name = "similarity"
             ,factorsAsStrings = TRUE)

#check the output of melt, maybe the long format is suitable for your task    
data.melt

#if you split the data by your similarity and check the unique products
#in each list, you will see that they are all the same
data.split <- split(data.melt, data.melt$similarity)

lapply(data.split, function(x) {

  unique(unlist(x[, c("product1", "product2")]))


})

答案 1 :(得分:0)

另一种方法可能是

#sample data (to understand this approach better I have slightly modified your input data)
mat <- Matrix(data = c(1,0,0,0,0,1,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1), nrow = 5, ncol = 5,
              dimnames = list(c("P1","P2","P3","P4","P5"),c("P1","P2","P3","P4","P5")),
              sparse = TRUE)
mat

#create dataframe having relationship among similar products
mat_summary <- summary(mat)
df <- data.frame(Product_Name = rownames(mat)[mat_summary$i],
                 Similar_Product_Name = colnames(mat)[mat_summary$j])
df <- df[df$Product_Name != df$Similar_Product_Name, ]
df

#clustering - to get the final result
library(igraph)
library(data.table)
df.g <- graph.data.frame(df)
final_df <- setNames(setDT(as.data.frame(clusters(df.g)$membership), keep.rownames = TRUE)[], c('Product', 'Product_Cluster'))
final_df

输出是:

   Product Product_Cluster
1:      P1               1
2:      P4               1
3:      P2               1
4:      P3               2
5:      P5               2