使用dplyr按特定值划分列

时间:2018-03-21 10:55:19

标签: r dplyr

我有一个这样的数据框:

 Setting   q02_id c_school c_home c_work c_transport c_leisure Country
   Rural 11900006        0      5      3           1         1 Vietnam
   Rural 11900031       10      5      0           0         0 China
   Rural 11900033        0      3      0           0         3 Vietnam
   Rural 11900053        0      7      2           0         0 Vietnam
   Rural 11900114        3      6      0           0         0 Malaysia
   Rural 11900446        0      6      0           0         0 Vietnam

我希望将第2,3,4,5,6列除以该特定国家/地区的总数。

在基础R中执行它有点笨拙:

df[df$Country=="Vietnam",][c(3, 4, 5, 6)] = df[df$Country=="Vietnam",][c(3, 4, 5, 6)] / sum(df[df$Country=="Vietnam",][c(3, 4, 5, 6)])

(我觉得有效)。

我正在尝试尽可能多地转换我的代码以使用tidyverse函数。有没有办法使用dplyr来更有效地做同样的事情?

感谢。

2 个答案:

答案 0 :(得分:0)

我相信这就是你所追求的:

将每列除以该列的总和 - 按国家/地区

分组
library(tidyverse)
df1 %>%
  group_by(Country) %>%
  mutate_at(vars(c_school: c_leisure), funs(./ sum(.)))
#output
  Setting   q02_id c_school c_home  c_work c_transport c_leisure Country 
  <fct>      <int>    <dbl>  <dbl>   <dbl>       <dbl>     <dbl> <fct>   
1 Rural   11900006   NaN     0.238   0.600        1.00     0.250 Vietnam 
2 Rural   11900031     1.00  1.00  NaN          NaN      NaN     China   
3 Rural   11900033   NaN     0.143   0            0        0.750 Vietnam 
4 Rural   11900053   NaN     0.333   0.400        0        0     Vietnam 
5 Rural   11900114     1.00  1.00  NaN          NaN      NaN     Malaysia
6 Rural   11900446   NaN     0.286   0            0        0     Vietnam 

或者将每列除以每个国家/地区的总和(如示例所示)(唯一的区别是我使用了第3:7列,因为我相信您的意图。

df1 %>%
  mutate(sum = rowSums(.[,3:7])) %>%
  group_by(Country) %>%
  mutate_at(vars(c_school: c_leisure), funs(./ sum(sum))) %>%
  select(-sum)
#output
  Setting   q02_id c_school c_home c_work c_transport c_leisure Country 
  <fct>      <int>    <dbl>  <dbl>  <dbl>       <dbl>     <dbl> <fct>   
1 Rural   11900006    0     0.161  0.0968      0.0323    0.0323 Vietnam 
2 Rural   11900031    0.667 0.333  0           0         0      China   
3 Rural   11900033    0     0.0968 0           0         0.0968 Vietnam 
4 Rural   11900053    0     0.226  0.0645      0         0      Vietnam 
5 Rural   11900114    0.333 0.667  0           0         0      Malaysia
6 Rural   11900446    0     0.194  0           0         0      Vietnam 

数据:

df1 = read.table(text ="Setting   q02_id c_school c_home c_work c_transport c_leisure Country
  Rural 11900006        0      5      3           1         1 Vietnam
  Rural 11900031       10      5      0           0         0 China
  Rural 11900033        0      3      0           0         3 Vietnam
  Rural 11900053        0      7      2           0         0 Vietnam
  Rural 11900114        3      6      0           0         0 Malaysia
  Rural 11900446        0      6      0           0         0 Vietnam", header = T)

答案 1 :(得分:0)

我知道您要求使用tidyverse函数,但这也是data.table软件包大放异彩的任务:

library(data.table)
setDT(df)
df[, lapply(.SD, function(x) x / sum(x)), by = Country, .SDcols = 3:7]

    Country c_school    c_home c_work c_transport c_leisure
1:  Vietnam      NaN 0.2380952    0.6           1      0.25
2:  Vietnam      NaN 0.1428571    0.0           0      0.75
3:  Vietnam      NaN 0.3333333    0.4           0      0.00
4:  Vietnam      NaN 0.2857143    0.0           0      0.00
5:    China        1 1.0000000    NaN         NaN       NaN
6: Malaysia        1 1.0000000    NaN         NaN       NaN