我有一个数据框,一组1200个个案,一式两份,总共2400个。即A1.1234567_10,A1.1234567_20,我想比较多列,每个重复对都有每列中的结果相同或不一致。列包含因素我如何才能使它能够为我的因素提供合理的因素。我想通过与_10和_20匹配的ID(即A1.1234567)来选择每个案例:
实施例 (一行数据框)
A1.1234567_10 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL
A1.1234567_20 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL ABNORMAL NORMAL
我喜欢输出看起来像这样(新数据框)
A1.1234567 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
这将通过比较_10和_20
的唯一ID号重复列中的所有样本答案 0 :(得分:3)
这是一个整齐的选择:
library(tidyverse)
df <- structure(list(ID = c("A1.1234567_10", "A1.1234567_20"),
var1 = c("NORMAL", "NORMAL"),
var2 = c("NORMAL", "NORMAL"),
var3 = c("NORMAL", "NORMAL"),
var4 = c("NORMAL", "NORMAL"),
var5 = c("NORMAL", "NORMAL"),
var6 = c("NORMAL", "NORMAL"),
var7 = c("NORMAL", "ABNORMAL"),
var8 = c("NORMAL", "NORMAL")),
.Names = c("ID", "var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8"),
class = "data.frame", row.names = c(NA, -2L))
# separate group variable from observation label
df_tidy <- df %>% separate(ID, c('ID', 'obs'), sep = '_')
df_tidy
#> ID obs var1 var2 var3 var4 var5 var6 var7 var8
#> 1 A1.1234567 10 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL
#> 2 A1.1234567 20 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL ABNORMAL NORMAL
df_tidy %>%
select(-obs) %>%
group_by(ID) %>%
summarise_all(lift(`==`))
#> # A tibble: 1 x 9
#> ID var1 var2 var3 var4 var5 var6 var7 var8
#> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A1.1234567 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
答案 1 :(得分:0)
使用tidyverse
的另一种方法(@ alistaire&#39; s dput
):
library(tidyverse)
library(stringr)
df %>%
group_by(ID = str_extract(ID, ".+(?=_)")) %>%
summarize_all(funs(dim(table(.)) == 1))
<强>结果:强>
# A tibble: 1 x 9
ID var1 var2 var3 var4 var5 var6 var7 var8
<chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 A1.1234567 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE