我有一个如下数据框。
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald trump agreement climate united states
3 donald trump agreement paris united states
4 donald trump agreement united states NA
5 donald trump climate emission united states
6 donald trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united kingdom chicken hen wimp
12 united states agreement paris NA
我希望结果作为数据框,其行如下
例如, Row1应该是这样的,因为它没有任何类似的行。
如果您看到行2,3,4,5和12.它们应该组合在一起,如
united states donald trump paris climate agreement emission
第7,9和11行应合并为
united kingdom chicken hen wimp mustard
它可以是任何顺序。
答案 0 :(得分:0)
假设数据框DF
在最后的注释中可重复显示。
将其转换为字符矩阵m
。让我们假设两行相似,如果它们有多个共同的元素,并定义is_similar
以获取两个行索引并相应地返回TRUE或FALSE。然后使用outer
将其应用于每对行。将其解释为图的邻接矩阵,并计算连接的组件将DF
拆分为列表L
,其中每个元素都是构成该连通组件的DF
行的数据帧..最后将L
重写为字符矩阵。
library(igraph)
m <- as.matrix(DF)
n <- nrow(m)
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1
smat <- outer(1:n, 1:n, Vectorize(is_similar))
adj <- graph.adjacency(smat)
cl <- components(adj)$membership
str(split(1:n, cl))
## List of 6
## $ 1: int 1
## $ 2: int [1:5] 2 3 4 5 12
## $ 3: int 6
## $ 4: int [1:3] 7 9 11
## $ 5: int 8
## $ 6: int 10
spl <- split(DF, cl)
L <- lapply(spl, function(x) na.omit(unique(unlist(x))))
t(do.call("cbind", lapply(L, ts)))
,并提供:
[,1] [,2] [,3] [,4] [,5] [,6]
1 "application" "android" "ios" NA NA NA
2 "donald_trump" "united_states" "agreement" "climate" "paris" "emission"
3 "donald_trump" "entertainer" "host" "president" NA NA
4 "hen" "pan" "united_kingdom" "chicken" "mustard" "wimp"
5 "husband" "pamela" "private_lives" NA NA NA
6 "sex" "associate" "pamela" "partner" NA NA
注意:可重复形式的输入是:
Lines <- "
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald_trump agreement climate united_states
3 donald_trump agreement paris united_states
4 donald_trump agreement united_states NA
5 donald_trump climate emission united_states
6 donald_trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private_lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united_kingdom chicken hen wimp
12 united_states agreement paris NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
更新:修正了相似性定义。