组合具有相似值的单元格

时间:2017-10-08 23:59:00

标签: r algorithm

我有一个如下数据框。

   New_ment1_1   New_ment1_2     New_ment1_3            New_ment1_4
 1 application     android           ios                     NA
 2 donald trump    agreement      climate               united states
 3 donald trump    agreement       paris                united states
 4 donald trump    agreement    united states                NA
 5 donald trump     climate      emission               united states
 6 donald trump   entertainer      host                  president
 7 hen             chicken       mustard                    wimp
 8 husband          pamela      private lives                NA
 9 pan             chicken         hen                      wimp
10 sex            associate        pamela                   partner
11 united kingdom  chicken         hen                      wimp
12 united states  agreement       paris                     NA

我希望结果作为数据框,其行如下

例如, Row1应该是这样的,因为它没有任何类似的行。

如果您看到行2,3,4,5和12.它们应该组合在一起,如

united states  donald trump  paris  climate  agreement  emission

第7,9和11行应合并为

united  kingdom  chicken  hen  wimp  mustard

它可以是任何顺序。

1 个答案:

答案 0 :(得分:0)

假设数据框DF在最后的注释中可重复显示。

将其转换为字符矩阵m。让我们假设两行相似,如果它们有多个共同的元素,并定义is_similar以获取两个行索引并相应地返回TRUE或FALSE。然后使用outer将其应用于每对行。将其解释为图的邻接矩阵,并计算连接的组件将DF拆分为列表L,其中每个元素都是构成该连通组件的DF行的数据帧..最后将L重写为字符矩阵。

library(igraph)

m <- as.matrix(DF)
n <- nrow(m)
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1
smat <- outer(1:n, 1:n, Vectorize(is_similar))

adj <- graph.adjacency(smat)
cl <- components(adj)$membership

str(split(1:n, cl))
## List of 6
##  $ 1: int 1
##  $ 2: int [1:5] 2 3 4 5 12
##  $ 3: int 6
##  $ 4: int [1:3] 7 9 11
##  $ 5: int 8
##  $ 6: int 10

spl <- split(DF, cl)
L <- lapply(spl, function(x) na.omit(unique(unlist(x))))
t(do.call("cbind", lapply(L, ts)))

,并提供:

  [,1]           [,2]            [,3]             [,4]        [,5]      [,6]      
1 "application"  "android"       "ios"            NA          NA        NA        
2 "donald_trump" "united_states" "agreement"      "climate"   "paris"   "emission"
3 "donald_trump" "entertainer"   "host"           "president" NA        NA        
4 "hen"          "pan"           "united_kingdom" "chicken"   "mustard" "wimp"    
5 "husband"      "pamela"        "private_lives"  NA          NA        NA        
6 "sex"          "associate"     "pamela"         "partner"   NA        NA      

注意:可重复形式的输入是:

Lines <- "
 New_ment1_1   New_ment1_2     New_ment1_3            New_ment1_4
 1 application     android           ios                     NA
 2 donald_trump    agreement      climate               united_states
 3 donald_trump    agreement       paris                united_states
 4 donald_trump    agreement    united_states                NA
 5 donald_trump     climate      emission               united_states
 6 donald_trump   entertainer      host                  president
 7 hen             chicken       mustard                    wimp
 8 husband          pamela      private_lives                NA
 9 pan             chicken         hen                      wimp
10 sex            associate        pamela                   partner
11 united_kingdom  chicken         hen                      wimp
12 united_states  agreement       paris                     NA"

DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

更新:修正了相似性定义。