比较重复样本

时间:2017-10-09 17:44:42

标签: r dataframe

我有一个数据框,一组1200个个案,一式两份,总共2400个。即A1.1234567_10,A1.1234567_20,我想比较多列,每个重复对都有每列中的结果相同或不一致。列包含因素我如何才能使它能够为我的因素提供合理的因素。我想通过与_10和_20匹配的ID(即A1.1234567)来选择每个案例:

实施例 (一行数据框)

A1.1234567_10 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL

A1.1234567_20 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL ABNORMAL NORMAL 

我喜欢输出看起来像这样(新数据框)

A1.1234567 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

这将通过比较_10和_20

的唯一ID号重复列中的所有样本

2 个答案:

答案 0 :(得分:3)

这是一个整齐的选择:

library(tidyverse)

df <- structure(list(ID = c("A1.1234567_10", "A1.1234567_20"), 
                     var1 = c("NORMAL", "NORMAL"), 
                     var2 = c("NORMAL", "NORMAL"), 
                     var3 = c("NORMAL", "NORMAL"), 
                     var4 = c("NORMAL", "NORMAL"), 
                     var5 = c("NORMAL", "NORMAL"), 
                     var6 = c("NORMAL", "NORMAL"), 
                     var7 = c("NORMAL", "ABNORMAL"), 
                     var8 = c("NORMAL", "NORMAL")), 
                .Names = c("ID", "var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8"), 
                class = "data.frame", row.names = c(NA, -2L))

# separate group variable from observation label
df_tidy <- df %>% separate(ID, c('ID', 'obs'), sep = '_')

df_tidy
#>           ID obs   var1   var2   var3   var4   var5   var6     var7   var8
#> 1 A1.1234567  10 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL   NORMAL NORMAL
#> 2 A1.1234567  20 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL ABNORMAL NORMAL

df_tidy %>% 
    select(-obs) %>% 
    group_by(ID) %>% 
    summarise_all(lift(`==`))
#> # A tibble: 1 x 9
#>           ID  var1  var2  var3  var4  var5  var6  var7  var8
#>        <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A1.1234567  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

答案 1 :(得分:0)

使用tidyverse的另一种方法(@ alistaire&#39; s dput):

library(tidyverse)
library(stringr)
df %>%
  group_by(ID = str_extract(ID, ".+(?=_)")) %>%
  summarize_all(funs(dim(table(.)) == 1))

<强>结果:

# A tibble: 1 x 9
          ID  var1  var2  var3  var4  var5  var6  var7  var8
       <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 A1.1234567  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE