分组数据中的缺失值R

时间:2018-05-09 15:43:37

标签: r dplyr

如果这是一个简单的问题,请道歉。我有整齐(长)格式的数据。我想看看Factor NameSample Name中每个样本的值集合的差异。# Groups: Sample Name `Sample Name` `Factor Name` mean <fct> <fct> <dbl> 1 S1 ABCD -5.15 2 S1 EFGH 7.74 3 S1 IJKL -7.43 4 S2 ABCD 4.35 5 S2 EFGH -2.15 6 S2 IJKL 2.33 7 S3 ABCD 5.53 8 S3 EFGH 2.84 9 S3 IJKL 1.61 10 S3 MNOP NaN 我相信它可以使用group_by函数。

Aggregate(`Factor Name` ~ `Sample Name`, df, FUN= function(x) setdiff(unique(df$`Factor Name`),x))

我也尝试过聚合,虽然它提供了输出,但我更喜欢group_by或管道效率方法。

Factor Name

如果可能,我希望能够为每个示例名称添加缺少的# Groups: Sample Name `Sample Name` `Factor Name` mean <fct> <fct> <dbl> 1 S1 ABCD -5.15 2 S1 EFGH 7.74 3 S1 IJKL -7.43 4 S1 MNOP NaN 5 S2 ABCD 4.35 6 S2 EFGH -2.15 7 S2 IJKL 2.33 8 S2 MNOP NaN 9 S3 ABCD 5.53 10 S3 EFGH 2.84 11 S3 IJKL 1.61 12 S3 MNOP NaN ,如下所示:

get-process

1 个答案:

答案 0 :(得分:1)

tidyr::expandtidyr::compelete函数可以帮助您实现目标。

加载套餐:

library(dplyr)
library(tidyr)

创建一个虚拟数据集:

df <- data_frame(sample_name = factor(c(rep(c('S1', 'S2', 'S3'), each = 3), 'S3')),
                 factor_name = factor(c(rep(c('ABCD', 'EFGH', 'IJKL'), 3), 'MNOP')),
                 mean = rnorm(n = 10, sd = 10))

问题1

factor_name中的每个样本获取sample_name中值集的差异:

# Return ONLY those levels of sample_name that are missing a level of factor_name
df %>% 
    # Expand to all unique combinations
    expand(sample_name, factor_name) %>% 
    # Extract the difference
    setdiff(., select(df, -mean)) 

#> # A tibble: 2 x 2
#>   sample_name factor_name
#>   <fct>       <fct>      
#> 1 S1          MNOP       
#> 2 S2          MNOP

# Return ALL levels of sample_name, along with any missing levels of factor_name
df %>% 
    # Expand to all unique combinations
    expand(sample_name, factor_name) %>% 
    # Extract the difference
    setdiff(., select(df, -mean)) %>% 
    # Expand to show all levels of sample_name
    complete(sample_name)

#> # A tibble: 3 x 2
#>   sample_name factor_name
#>   <fct>       <fct>      
#> 1 S1          MNOP       
#> 2 S2          MNOP       
#> 3 S3          <NA>

问题2

为每个factor_name添加缺少的sample_name

# Expand to include ALL levels of factor_name within sample_name
df %>% 
    complete(sample_name, factor_name) 

#> # A tibble: 12 x 3
#>    sample_name factor_name     mean
#>    <fct>       <fct>          <dbl>
#>  1 S1          ABCD         16.6   
#>  2 S1          EFGH         -0.0803
#>  3 S1          IJKL          4.80  
#>  4 S1          MNOP         NA     
#>  5 S2          ABCD          3.80  
#>  6 S2          EFGH         -1.24  
#>  7 S2          IJKL          1.50  
#>  8 S2          MNOP         NA     
#>  9 S3          ABCD         -5.94  
#> 10 S3          EFGH         10.4   
#> 11 S3          IJKL        -14.3   
#> 12 S3          MNOP         -6.87

reprex package(v0.2.0)创建于2018-05-10。