Question

我有一些数据可以给我一些受过不同程度教育的人群中的百分比：

df <- data_frame(group = c("A", "B"),
             no.highschool = c(20, 10),
             high.school = c(70,40),
             college = c(10, 40),
             graduate = c(0,10))

df
    # A tibble: 2 x 5
  group no.highschool high.school college graduate
  <chr>         <dbl>       <dbl>   <dbl>    <dbl>
1 A               20.         70.     10.       0.
2 B               10.         40.     40.      10.

例如，在A组中，有70％的人接受过高中教育。

我想生成4个变量，以使每个组中受教育程度低于4个教育水平（例如，lessthan_no.highschool，lessthan_high.school等）中的每个人的比例。

所需的df为：

desired.df <- data.frame(group = c("A", "B"),
                     no.highschool = c(20, 10),
                     high.school = c(70,40),
                     college = c(10, 40),
                     graduate = c(0,10),
                     lessthan_no.highschool = c(0,0),
                     lessthan_high.school = c(20, 10),
                     lessthan_college = c(90, 50),
                     lessthan_graduate = c(100, 90))

在我的实际数据中，我有很多小组，而且受教育的程度更高。当然，我可以一次执行一个变量，但是如何使用tidyverse工具以编程方式（优雅地）执行此操作？

我将首先在mutate_at()内执行类似map()的操作，但是我被绊倒的是对于每个新变量，要求和的变量列表不同。您可以将新变量及其对应变量的列表作为两个列表传递到pmap()中，但是如何简洁地生成第二个列表并不明显。想知道是否有某种嵌套解决方案...

Answer 1

这是基本的R解决方案。尽管该问题要求一个tidyverse，但考虑到该问题的注释中的对话框，我决定将其发布。
它使用apply和cumsum进行艰苦的工作。然后在cbind进入最终结果之前还有一些修饰问题。

tmp <- apply(df[-1], 1, function(x){
    s <- cumsum(x)
    100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))

desired.df
#  group no.highschool high.school college graduate lessthan_no.highschool
#1     A            20          70      10        0                      0
#2     B            10          40      40       10                      0
#  lessthan_high.school lessthan_college lessthan_graduate
#1                   20               90               100
#2                   10               50                90

Answer 2

我如何使用tidyverse工具以编程方式（优雅地）做到这一点？

绝对第一步是整理数据。列名称中的编码信息（例如edu级）不是<整洁的。将education转换为因子时，请确保级别按正确的顺序-我使用了它们在原始数据列名称中出现的顺序。

library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
  mutate(education = factor(education, levels = names(df)[-1])) %>%
  group_by(group) %>%
  mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
  arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups:   group [2]
#   group education         n lessthan_x
#   <chr> <fct>         <dbl>      <dbl>
# 1 A     no.highschool    20          0
# 2 A     high.school      70         20
# 3 A     college          10         90
# 4 A     graduate          0        100
# 5 B     no.highschool    10          0
# 6 B     high.school      40         10
# 7 B     college          40         50
# 8 B     graduate         10         90

这给我们一个很好的，整洁的结果。如果您想spread / cast将此数据转换为不整齐的desired.df格式，我建议使用data.table::dcast，因为（据我所知）tidyverse不提供散布多列的好方法。请参阅Spreading multiple columns with tidyr或How can I spread repeated measures of multiple variables into wide format?以获取data.table解决方案或较淡的tidyr / dplyr版本。传播之前，您可以创建密钥less_than_x_key = paste("lessthan", education, sep = "_")。

以编程方式创建新变量，这些变量是其他变量的嵌套系列之和

2 个答案: