R中基于分组的条件计算

时间:2019-07-12 04:40:37

标签: r dplyr conditional-statements grouping data-manipulation

数据如下:

df1=data.frame(Date=as.Date(c('8/27/2001','8/27/2001','8/27/2001','11/13/2001','11/13/2001','11/13/2001','8/3/2012','8/3/2012'),format="%m/%d/%Y"),
    Name=c('Joe', 'Joe', 'Joe', 'Billy', 'Billy', 'Billy','Emma','Emma'),
    Sample=c('Pre','Post','Discard','Pre','Post','Discard','Bone','Pre'),
    Cells=c(15,7,3,12,5,2,14,NA))
    Date        Name    Sample Cells
1   2001-08-27  Joe     Pre     15
2   2001-08-27  Joe     Post    7
3   2001-08-27  Joe     Discard 3
4   2001-11-13  Billy   Pre     12
5   2001-11-13  Billy   Post    5
6   2001-11-13  Billy   Discard 2
7   2012-08-03  Emma    Bone    14
8   2012-08-03  Emma    Pre     NA

我想基于日期和名称的唯一分组添加一个名为“ Yield”的计算列(例如,条目1-3、4-6或7-8都代表不同的组)。实际数据可能不完整(请参阅条目7-8)。

“收益”列应为:

Cells where Sample="Post" divided by Cells where Sample="Pre"

所需的输出:

    Date        Name    Sample Cells Yield
1   2001-08-27  Joe     Pre     15   NA
2   2001-08-27  Joe     Post    7    0.46
3   2001-08-27  Joe     Discard 3    NA
4   2001-11-13  Billy   Pre     12   NA
5   2001-11-13  Billy   Post    5    0.41
6   2001-11-13  Billy   Discard 2    NA
7   2012-08-03  Emma    Bone    14   NA
8   2012-08-03  Emma    Pre     NA   NA

我是R的新手,并且想高效地使用它(例如,使用dplyr)。以上可以通过循环来完成,但是我正在寻找更优雅的解决方案。我已经咨询了以下主题以寻求指导,但到目前为止尚未找到解决方案:

Assign value to group based on condition in column

R create column from another column, depending on row

Conditional calculation in R based on Row values and categories

2 个答案:

答案 0 :(得分:1)

您可以这样做:

library(dplyr)

df1 %>%
  group_by(Date, Name) %>%
  mutate(Yield = ifelse(Sample == "Post", Cells[Sample == "Post"]/Cells[Sample == "Pre"], NA))

# A tibble: 8 x 5
# Groups:   Name [3]
  Date       Name  Sample  Cells  Yield
  <date>     <fct> <fct>   <dbl>  <dbl>
1 2001-08-27 Joe   Pre        15 NA    
2 2001-08-27 Joe   Post        7  0.467
3 2001-08-27 Joe   Discard     3 NA    
4 2001-11-13 Billy Pre        12 NA    
5 2001-11-13 Billy Post        5  0.417
6 2001-11-13 Billy Discard     2 NA    
7 2012-08-03 Emma  Bone       14 NA    
8 2012-08-03 Emma  Pre        NA NA    

答案 1 :(得分:1)

如果您不太喜欢特定的表格格式,则可以执行以下操作:

library(dplyr)
library(tidyr)

df1 %>% 
    spread(Sample, Cells) %>% 
    mutate(Pre_Post_Yield = Post/Pre)

这将返回一个更易于理解的表:

        Date  Name Bone Discard Post Pre Pre_Post_Yield
1 2001-08-27   Joe   NA       3    7  15      0.4666667
2 2001-11-13 Billy   NA       2    5  12      0.4166667
3 2012-08-03  Emma   14      NA   NA  NA             NA

要返回长格式,可以添加gather(Sample, Cells, Bone:Pre)。请注意,结果看起来将与示例输出完全不同,因为R将填充以前不存在的变量组合。乍一看可能有点怪异,但您会发现它实际上非常有用,例如因为它使您丢失的数据变得明确:

         Date  Name Pre_Post_Yield  Sample Cells
1  2001-08-27   Joe      0.4666667    Bone    NA
2  2001-11-13 Billy      0.4166667    Bone    NA
3  2012-08-03  Emma             NA    Bone    14
4  2001-08-27   Joe      0.4666667 Discard     3
5  2001-11-13 Billy      0.4166667 Discard     2
6  2012-08-03  Emma             NA Discard    NA
7  2001-08-27   Joe      0.4666667    Post     7
8  2001-11-13 Billy      0.4166667    Post     5
9  2012-08-03  Emma             NA    Post    NA
10 2001-08-27   Joe      0.4666667     Pre    15
11 2001-11-13 Billy      0.4166667     Pre    12
12 2012-08-03  Emma             NA     Pre    NA