dplyr创建基于两列计算百分比的列

时间:2018-07-23 15:44:31

标签: r dplyr

我有一个csv文件,如下所示:

Year, Answer, Total
2017, Yes, 100
2017, No, 10
2017, Yes, 100
2018, No, 40
2018, Yes, 200

我正在尝试创建一个列,用于计算给定年份中“否”与“是”之间的比率。所以看起来像这样:

Year, Answer, Total, Ratio
2017, Yes, 100, 1
2017, No, 10, 0.05
2017, Yes, 100, 1
2018, No, 40, 0.2 
2018, Yes, 200, 1

我正在使用R和dplyr。我想我必须创建一个列,其中包含给定年份中“是”的总数(将有重复项)。然后使用ifelse语句创建另一列,其中“是”行将为1,而“否”行将为总NO数除以“是”总数。有没有更有效的方法可以做到这一点?谢谢

3 个答案:

答案 0 :(得分:2)

怎么样?

library(dplyr)

xdf <- data.frame(
  stringsAsFactors = FALSE,
  Year = c(2017, 2017, 2017, 2018, 2018),
  Answer = c("Yes", "No", "Yes", "No", "Yes"),
  Total = c(100, 10, 100, 40, 200)
)

xdf %>% 
  group_by(Year, Answer) %>% 
  summarise(Total = sum(Total)) %>% 
  mutate(share = if_else(Answer == "No", Total/lead(Total), 1))
#> # A tibble: 4 x 4
#> # Groups:   Year [2]
#>    Year Answer Total share
#>   <dbl> <chr>  <dbl> <dbl>
#> 1  2017 No        10  0.05
#> 2  2017 Yes      200  1   
#> 3  2018 No        40  0.2 
#> 4  2018 Yes      200  1

答案 1 :(得分:0)

这是一种使用自定义功能的方法

# function calculating the ratios
f1 <- function(k){
   ind.yes <- intersect(which(df$year == df$year[k]),
                        which(df$answer == "yes")
               )
   ind.no <- intersect(which(df$year == df$year[k]),
                       which(df$answer == "no")
             )
   total.yes <- sum(df$total[ind.yes])
   total.no <- sum(df$total[ind.no])

   ratio.no.yes <- total.no/total.yes
   return(ratio.no.yes)
}

# vapplying function f1
ratios <- vapply(1:nrow(df), f1, numeric(1))

# binding the data
df$ratios <- ratios

这是结果(使用虚拟数据帧)

df <- data.frame(
                 year = sample(2015:2018, 10, replace = T),
                 answer = sample(c("yes", "no"), 10, replace = T),
                 total = sample(10:200, 10, replace = T),
                 stringsAsFactors = F)
ratios <- vapply(1:nrow(df), f1, numeric(1))
df$ratios <- ratios

# printing
> df
  year answer total     ratios
1  2015    yes    76 0.08294931
2  2017    yes    43 2.55263158
3  2018    yes    63 0.00000000
4  2016    yes    61 0.83606557
5  2015     no    18 0.08294931
6  2017     no   142 2.55263158
7  2017    yes    33 2.55263158
8  2015    yes   141 0.08294931
9  2016     no    51 0.83606557
10 2017     no    52 2.55263158

答案 2 :(得分:0)

我认为效率对此并不重要。您可以将它设为单线,尽管很难阅读:

DF %>% group_by(Year) %>% mutate(v = 
  (Total / sum(Total[Answer == "Yes"]))^(Answer == "No")
)

当答案!=“否”时,此x^cond使用x ^ FALSE = x ^ 0 = 1分配所需的值1。