基于"伙伴关系构建群集"

时间:2015-08-18 12:02:25

标签: r

前段时间我做过类似的thread但不幸的是,我在那里使用的方法并没有给我任何有希望的结果。我想到了如何以不同的方式做到这一点。我在这里。

当然示例数据:

structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed", 
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4", 
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag", 
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230", 
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant", 
"Duster 360")), .Names = c("Name1", "Name2"), class = "data.frame", row.names = c(NA, 
12L))

此数据框仅包含两列。原始数据有更多,但这次我将只关注那些列。

为了表明我想如何将这些人聚集在一起,我会提出一个理想的输出:

structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed", 
"Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4", 
"Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag", 
"Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230", 
"KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant", 
"Duster 360"), cluster = c(1, 1, 2, 1, 2, 3, 2, 0, 0, 0, 3, 3
)), .Names = c("Name1", "Name2", "cluster"), row.names = c(NA, 
12L), class = "data.frame")

从输出中可以看出,我想根据合作伙伴对汽车进行集群,这可以在第二列中找到。因此,如果一排中的汽车共享相同的"伙伴"在下一栏中,他们应该聚集在一起。

它如何在表格中看起来有一点解释:

           Name1         Name2 cluster
1      Mazda RX4 Mazda RX4 Wag       1  ## Two Mazda's same cluster
2      Mazda RX4      Merc 230       1  ## First Mazda with another partner
3       KIA Ceed     KIA Sport       2  ## Ceed together with Sport
4  Mazda RX4 Wag      Merc 230       1  ## Second Mazda with the same partner
5       KIA Ceed   KIA Classic       2  ## Ceed together with Classic
6        Valiant      Merc 230       3  
7    KIA Classic     KIA Sport       2  ## And of course Classic with Sport
8      Mazda RX4     Merc 240D       0  ## First Mazda with another Merc but can't be clustered together in the cluster number 1 because the second Mazda doesn't share this "partner".
9          Dacia Mazda RX4 Wag       0  ## Similar situation but just second Mazda
10      Merc 280    Merc 450SE       0
11    Duster 360       Valiant       3
12      Merc 230    Duster 360       3  

这只是我想要实现的简单示例。当然,根据我的原始数据,可能会发生一些汽车将成为不同集群的成员。簇号可以用逗号分隔,或者如果需要可以创建另一列。 对于不能与其他任何东西聚类的对,设置0不是强制性的。它们只能形成一个单行的集群。无论如何我都不会分析它。

我希望我能够准确地解释我想要实现的目标。非常欢迎创意。

当然,我想像我在之前的帖子中所做的那样,为满足我的答案开始赏金。

2 个答案:

答案 0 :(得分:8)

从评论中,“我想创建包含至少3个不同基因的簇,并且它们彼此相互作用。”

这种描述似乎对应于图论中的集团的定义。也就是说,你似乎正在寻找所有3或更大的集团。

cliques

使用您的示例data

library(igraph)
g<-graph.data.frame(data,directed=FALSE)
(q<-cliques(g,min=3))
#> [[1]]
#> + 3/12 vertices, named:
#> [1] Mazda RX4     Mazda RX4 Wag Merc 230     
#> 
#> [[2]]
#> + 3/12 vertices, named:
#> [1] KIA Ceed    KIA Classic KIA Sport  
#> 
#> [[3]]
#> + 3/12 vertices, named:
#> [1] Valiant    Duster 360 Merc 230  

你认识到任何边缘都可以属于多个集团,所以我在每个集团创建了一个列,并带有一个用于向该集团发送信号的标志。

ind<-t(apply(data,1,function(r) sapply(q,function(i) all(as.character(r) %in% names(i)))))
(d1<-cbind(data,ind))
           Name1         Name2     1     2     3
1      Mazda RX4 Mazda RX4 Wag  TRUE FALSE FALSE
2      Mazda RX4      Merc 230  TRUE FALSE FALSE
3       KIA Ceed     KIA Sport FALSE  TRUE FALSE
4  Mazda RX4 Wag      Merc 230  TRUE FALSE FALSE
5       KIA Ceed   KIA Classic FALSE  TRUE FALSE
6        Valiant      Merc 230 FALSE FALSE  TRUE
7    KIA Classic     KIA Sport FALSE  TRUE FALSE
8      Mazda RX4     Merc 240D FALSE FALSE FALSE
9          Dacia Mazda RX4 Wag FALSE FALSE FALSE
10      Merc 280    Merc 450SE FALSE FALSE FALSE
11    Duster 360       Valiant FALSE FALSE  TRUE
12      Merc 230    Duster 360 FALSE FALSE  TRUE

或者,您可以将它们显示在data.frame的每一行的列表中。

(d2<-cbind(data,clique=I(as.list(apply(ind,1,which)))))

           Name1         Name2 clique
1      Mazda RX4 Mazda RX4 Wag      1
2      Mazda RX4      Merc 230      1
3       KIA Ceed     KIA Sport      2
4  Mazda RX4 Wag      Merc 230      1
5       KIA Ceed   KIA Classic      2
6        Valiant      Merc 230      3
7    KIA Classic     KIA Sport      2
8      Mazda RX4     Merc 240D       
9          Dacia Mazda RX4 Wag       
10      Merc 280    Merc 450SE       
11    Duster 360       Valiant      3
12      Merc 230    Duster 360      3

答案 1 :(得分:3)

可能效率低得多,但是建议/备注:

cars <- structure(list(Name1 = c("Mazda RX4", "Mazda RX4", "KIA Ceed", 
 "Mazda RX4 Wag", "KIA Ceed", "Valiant", "KIA Classic", "Mazda RX4", 
 "Dacia", "Merc 280", "Duster 360", "Merc 230"), Name2 = c("Mazda RX4 Wag", 
 "Merc 230", "KIA Sport", "Merc 230", "KIA Classic", "Merc 230", 
 "KIA Sport", "Merc 240D", "Mazda RX4 Wag", "Merc 450SE", "Valiant", 
 "Duster 360")), .Names = c("Name1", "Name2"), class = "data.frame", row.names = c(NA, 
 12L))

# Add the cluster number column to df, first row being cluster 1
cars$cluster <- c(1,rep(0,nrow(cars)-1))

# First cluster, we have to start somewhere
clusters <- list(c(paste0(cars$Name1[1]),paste0(cars$Name2[1]))) # Side note, use of paste0 for a readable output in case of factorized dataframe


# Now the ugly part, loop over the df starting at row 2
for (i in 2:nrow(cars)) {
  # Get the cars name in a more easy variable name
  c1 <- paste0(cars$Name1[i])
  c2 <- paste0(cars$Name2[i])
  # boolean to know if a new cluster have to be created
  found <- F

  # Check if first car is referenced somewhere in a cluster
  if (c1 %in% unlist(clusters)) {
    # It is, loop over the cluster list to find in wich
    for (j in 1:length(clusters)) {
      cl <- clusters[[j]] # Same shortcut var for the cluster
      if (c1 %in% cl) { # Find which cluster c1 is part of
        others <- cl[cl != c1] # Now get the other cluster members
        # Now check if the partner exists in df with relation to 1 of the others 
        if ( any( (cars$Name1 %in% others & cars$Name2 == c2)  
                | (cars$Name2 %in% others & cars$Name1 == c2) 

             )
           )
        {
          if (!c2 %in% cl) {
            clusters[[j]] <- append(cl,c2) # Update the cluster with partner car if not already present

          }
          found <- T # Set the boolean
          break # We can stop looping in the cluster list now
        }

      }
    }
  } else if (c2 %in% unlist(clusters)) { # Same as previous block with c1 and c2 swapped
    for (j in 1:length(clusters)) {
      cl <- clusters[[j]]
      if (c2 %in% cl) {
        others <- cl[cl != c2]
        if ( any( (cars$Name1 %in% others & cars$Name2 == c1) 
                | (cars$Name2 %in% others & cars$Name1 == c1) 
             )
           )
        {
          if (!c1 %in% cl) {
            clusters[[j]] <- append(cl,c1)
          }
          found <- T
          break
        }
      }
    }
  }
  # If the pair could be related to a cluster, update the df and got to next row
  if (found == T) { 
    cars$cluster[i] <- j
    next
  }
  # We didn't found a match, just create a new cluster with the pair and udpate the df
  clusters[[length(clusters)+1]] <- c(c1,c2)
  cars$cluster[i] <- length(clusters)
}

输出:

> cars
           Name1         Name2 cluster
1      Mazda RX4 Mazda RX4 Wag       1
2      Mazda RX4      Merc 230       1
3       KIA Ceed     KIA Sport       2
4  Mazda RX4 Wag      Merc 230       1
5       KIA Ceed   KIA Classic       2
6        Valiant      Merc 230       3
7    KIA Classic     KIA Sport       2
8      Mazda RX4     Merc 240D       4
9          Dacia Mazda RX4 Wag       5
10      Merc 280    Merc 450SE       6
11    Duster 360       Valiant       3
12      Merc 230    Duster 360       3

或者在群集列表视图中:

> clusters
[[1]]
[1] "Mazda RX4"     "Mazda RX4 Wag" "Merc 230"     

[[2]]
[1] "KIA Ceed"    "KIA Sport"   "KIA Classic"

[[3]]
[1] "Valiant"    "Merc 230"   "Duster 360"

[[4]]
[1] "Mazda RX4" "Merc 240D"

[[5]]
[1] "Dacia"         "Mazda RX4 Wag"

[[6]]
[1] "Merc 280"   "Merc 450SE"