我有一些使用.groupby
和.agg
的python代码将数据帧转换为汇总表,并且在转换为R时遇到了麻烦。我想要的输出如下所示:
想象一下,起始数据帧df
看起来像这样:
Y1 Y2 Y3 Sex X1 X2 X3 X4 X5 X6
1 0 1 Male 52 2 7.25 11.40 0.50 2
0 0 0 Female 42 1 2.00 27.00 1.00 2
1 0 1 Male 46 4 0.08 16.20 0.17 3
0 0 0 Female 60 3 5.65 2.00 1.68 1
1 0 1 Male 81 1 1.37 9.20 0.80 0
0 0 0 Female 44 2 0.87 15.40 1.00 0
1 0 1 Male 61 4 0.87 19.40 0.25 2
0 0 0 Female 46 1 2.00 7.20 1.00 1
1 0 1 Male 56 1 7.25 1.40 0.45 2
0 0 0 Female 54 2 2.00 25.20 1.00 3
我希望能够将df
转换为R中的图1。到目前为止,我已经确定可以使用dplyr
包对数据进行分组并汇总:
df %>%
group_by(Sex) %>%
summarize(
m = mean(X1, na.rm=TRUE),
sd = sd(X1)
)
但是,这只能为我提供变量X1的摘要,我需要将其按Y1,Y2,Y3和其余X变量分组。
那么我该如何编码,使其如图1所示?
FWIW,这或多或少是我在python中使用的代码,但对于R,我需要它。
Y1_ = df.groupby(['Y1','Sex']).agg(['mean','std']).round(2)
Y2_ = df.groupby(['Y2','Sex']).agg(['mean','std']).round(2)
Y3_ = df.groupby(['Y3','Sex']).agg(['mean','std']).round(2)
frames = [Y1_, Y2_, Y3_]
table1 = pd.concat(frames, keys=['Y1','Y2','Y3'], ignore_index=False)
答案 0 :(得分:2)
这是您的数据。
db <- structure(list(Y1 = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0), Y2 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Y3 = c(1, 0, 1, 0, 1, 0, 1, 0, 1,
0), Sex = c("Male", "Female", "Male", "Female", "Male", "Female",
"Male", "Female", "Male", "Female"), X1 = c(52, 42, 46, 60, 81,
44, 61, 46, 56, 54), X2 = c(2, 1, 4, 3, 1, 2, 4, 1, 1, 2), X3 = c(7.25,
2, 0.08, 5.65, 1.37, 0.87, 0.87, 2, 7.25, 2), X4 = c(11.4, 27,
16.2, 2, 9.2, 15.4, 19.4, 7.2, 1.4, 25.2), X5 = c(0.5, 1, 0.17,
1.68, 0.8, 1, 0.25, 1, 0.45, 1), X6 = c(2, 2, 3, 1, 0, 0, 2,
1, 2, 3)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
这是代码。我认为Y1,Y2和Y3应该从宽格式更改为长格式。这就是为什么我首先使用collect函数。
db_pro1 <- db %>%
gather(y, value, starts_with("Y")) %>%
mutate(y_value = paste0(y, "-" ,value)) %>%
group_by(y_value, Sex) %>%
summarise_at(vars(starts_with("X")), funs(mean = mean(.), sd = sd(.)))
# A tibble: 6 x 14
# Groups: y_value [5]
y_value Sex X1_mean X2_mean X3_mean X4_mean X5_mean X6_mean X1_sd X2_sd X3_sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Y1-0 Fema~ 49.2 1.8 2.50 15.4 1.14 1.4 7.56 0.837 1.83
2 Y1-1 Male 59.2 2.4 3.36 11.5 0.434 1.8 13.4 1.52 3.58
3 Y2-0 Fema~ 49.2 1.8 2.50 15.4 1.14 1.4 7.56 0.837 1.83
4 Y2-0 Male 59.2 2.4 3.36 11.5 0.434 1.8 13.4 1.52 3.58
5 Y3-0 Fema~ 49.2 1.8 2.50 15.4 1.14 1.4 7.56 0.837 1.83
6 Y3-1 Male 59.2 2.4 3.36 11.5 0.434 1.8 13.4 1.52 3.58
# ... with 3 more variables: X4_sd <dbl>, X5_sd <dbl>, X6_sd <dbl>
答案 1 :(得分:1)
我的答案与@juhyeon相似,但(1)我没有将Y值与Use
结合在一起,(2)我做了一些重命名以使输出更像示例。
df %>%
gather(DepVar, Use, 1:3) %>%
mutate(Use = ifelse(Use == 0, "No", "Yes")) %>%
group_by(DepVar, Use, Sex) %>%
summarise_at(vars(starts_with("X")), list(mean = mean, sd = sd)) %>%
select(DepVar, Use, Sex,
X1_mean, X1_sd,
X2_mean, X2_sd,
X3_mean, X3_sd,
X4_mean, X4_sd,
X5_mean, X5_sd,
X6_mean, X6_sd)
结果:
# A tibble: 6 x 15
# Groups: DepVar, Use [5]
DepVar Use Sex X1_mean X1_sd X2_mean X2_sd X3_mean X3_sd X4_mean X4_sd X5_mean X5_sd X6_mean X6_sd
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Y1 No Female 49.2 7.56 1.8 0.837 2.50 1.83 15.4 10.9 1.14 0.304 1.4 1.14
2 Y1 Yes Male 59.2 13.4 2.4 1.52 3.36 3.58 11.5 6.92 0.434 0.246 1.8 1.10
3 Y2 No Female 49.2 7.56 1.8 0.837 2.50 1.83 15.4 10.9 1.14 0.304 1.4 1.14
4 Y2 No Male 59.2 13.4 2.4 1.52 3.36 3.58 11.5 6.92 0.434 0.246 1.8 1.10
5 Y3 No Female 49.2 7.56 1.8 0.837 2.50 1.83 15.4 10.9 1.14 0.304 1.4 1.14
6 Y3 Yes Male 59.2 13.4 2.4 1.52 3.36 3.58 11.5 6.92 0.434 0.246 1.8 1.10