我有一个数据集,其中不同的行具有不同的元素组合,我想拉出具有相同元素组合的行组。对于此示例数据集:
id <- c("A", "B", "C", "D")
X1 <- c(NA,NA,NA,"X1")
X2 <- c(NA,NA,"X2","X2")
X3 <- c("X3","X3","X3","X3")
X4 <- c("X4", "X4", "X4", "X4")
df <- data.frame(id,X1,X2,X3,X4)
> df
id X1 X2 X3 X4
1 A <NA> <NA> X3 X4
2 B <NA> <NA> X3 X4
3 C <NA> X2 X3 X4
4 D X1 X2 X3 X4
我希望能够退出
我尝试将数据框拆分为列表并删除空单元格,以便每个id在列表中获得自己的data.frame:
df.list <- split(df, seq(nrow(df)))
dfComplete.list <- lapply(df.list, function(remNA) remNA[,colSums(is.na(remNA)) < nrow(remNA)])
离开了我
> dfComplete.list
$`1`
id X3 X4
1 1 X3 X4
$`2`
id X3 X4
2 2 X3 X4
$`3`
id X2 X3 X4
3 3 X2 X3 X4
$`4`
id X1 X2 X3 X4
4 4 X1 X2 X3 X4
我很难过从这里出发去哪里。有没有办法根据它们共有的元素/列对列表中的数据帧进行分组?
我真正使用的真实数据集实际上有元素/列X7到X17,每个id有1到4个元素,所以理想的解决方案是能够识别我所存在的元素的所有组合数据
最后,在我将数据重新设置为上述格式之前,我的数据最初是以下格式,以防万一从原始格式中找到解决方案的方法更简单:
id <- c("A", "A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
elements <- c("X3", "X4", "X3", "X4", "X2", "X3", "X4", "X1", "X2", "X3", "X4")
dataLong <- data.frame(id, elements)
> dataLong
id elements
1 A X3
2 A X4
3 B X3
4 B X4
5 C X2
6 C X3
7 C X4
8 D X1
9 D X2
10 D X3
11 D X4
提前感谢您的帮助!
答案 0 :(得分:0)
reshape2::dcast
函数可以帮助将数据从长格式转换为OP期望的格式。
#Data
id <- c("A", "A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
elements <- c("X3", "X4", "X3", "X4", "X2", "X3", "X4", "X1", "X2", "X3", "X4")
dataLong <- data.frame(id, elements, stringsAsFactors = FALSE)
library(reshape2)
#Use dcast to get the result
dataLong %>% dcast(id~elements)
# id X1 X2 X3 X4
# 1 A <NA> <NA> X3 X4
# 2 B <NA> <NA> X3 X4
# 3 C <NA> X2 X3 X4
# 4 D X1 X2 X3 X4
答案 1 :(得分:0)
我知道你想要计算独特的组合。我就是这样做的
library(dplyr)
library(tidyr)
dataLong %>% mutate(value=1) %>%
spread(elements, value) %>%
select(-id) %>%
group_by_all() %>%
summarise(count=n()) %>% ungroup()
#> # A tibble: 3 x 5
#> X1 X2 X3 X4 count
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1 1 1 1
#> 2 NA 1 1 1 1
#> 3 NA NA 1 1 2
答案 2 :(得分:0)
您可以使用tidyverse
!使用arrange()
有点多余,但我想向您展示该选项,因为它会安排您的数据框以反映您感兴趣的分组(您可以将其视为一种嵌套排序)。这可能就是您所需要的一切。
如果您想要实际计数,以及一列可以告诉您哪些ID对应于哪些组合,那么只需运行下面的完整代码即可。请注意,您必须在完整代码中添加所有变量(X7:X17
)。在声明数据框时,您还希望使用stringsAsFactors = FALSE
,这是一般的好习惯。
# Your example dataframe. Make sure to set stringsAsFactors = FALSE
id <- c("A", "B", "C", "D")
X1 <- c(NA,NA,NA,"X1")
X2 <- c(NA,NA,"X2","X2")
X3 <- c("X3","X3","X3","X3")
X4 <- c("X4", "X4", "X4", "X4")
df <- data.frame(id,X1,X2,X3,X4, stringsAsFactors = FALSE)
# We group rows by all unique combinations and then collapse those rows,
# while recording which ids belong to which grouping, and how many there are
# in each.
library(tidyverse)
ndf <- arrange(df, X1,X2,X3,X4) %>%
group_by(X1,X2,X3,X4) %>%
summarise(num = n(), id = paste(id, collapse=","))
# Output:
# A tibble: 3 x 6
# Groups: X1, X2, X3 [?]
X1 X2 X3 X4 num id
<chr> <chr> <chr> <chr> <int> <chr>
1 X1 X2 X3 X4 1 D
2 <NA> X2 X3 X4 1 C
3 <NA> <NA> X3 X4 2 A,B