我知道有很多问题,可能以一种或另一种方式听起来很相似,但我一直无法找到确切问题的答案。
让我们说我们有一个玩具数据集
library(tidyverse)
df <- tibble(
Gender = c("m", "f", "f", "m", "m",
"f", "f", "f", "m", "f"),
IQ = rnorm(10, 100, 15),
Other = runif(10),
Test = rnorm(10),
group2 = c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B")
)
我们要从中计算mean
和min
的{{1}},max
和gender
。
仅对于一组,我可以轻松编写
group2
获取
df %>%
group_by(Gender) %>%
select_if(is.numeric) %>%
gather(Variable, Value, -Gender) %>%
group_by(Variable, Gender) %>%
summarise(mean = mean(Value),
min = min(Value),
max = max(Value)) %>%
ungroup()
但是我不知道如何对多个组执行相同的操作。我知道我可以像这样使用 Variable Gender mean min max
<chr> <chr> <dbl> <dbl> <dbl>
1 IQ f 99.2 81.9 121.
2 IQ m 89.0 62.5 106.
3 Other f 0.301 0.187 0.479
4 Other m 0.395 0.0483 0.757
5 Test f -0.0770 -1.18 0.545
6 Test m 0.163 -0.632 0.828
summarise_*()
但是它返回宽格式(例如df %>%
group_by(Gender) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max))
)
data.table
,当您有10个以上的变量时,它几乎毫无用处。
我在这里想念什么?
答案 0 :(得分:2)
您可以通过在自己的代码中添加gather
,separate
和spread
来实现:
df %>%
group_by(Gender, group2) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max)) %>%
gather(vars, vals, -Gender, -group2) %>%
separate(vars, c("Variable", "stat")) %>%
spread(stat, vals)
#### OUTPUT ####
# A tibble: 12 x 6
# Groups: Gender [2]
Gender group2 Variable max mean min
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 f A IQ 110. 103. 95.0
2 f A Other 0.934 0.469 0.00439
3 f A Test 1.39 0.472 -0.446
4 f B IQ 121. 92.0 75.6
5 f B Other 0.730 0.461 0.261
6 f B Test 0.589 0.276 -0.524
7 m A IQ 112. 104. 94.3
8 m A Other 0.827 0.613 0.308
9 m A Test 0.724 0.136 -0.264
10 m B IQ 115. 115. 115.
11 m B Other 0.970 0.970 0.970
12 m B Test -1.05 -1.05 -1.05
答案 1 :(得分:1)
您可以先将HashMap
,df
和IQ
收集到一个变量列中,然后将Other
转换为长格式,然后计算每个组的摘要统计信息(组2变量):
Test
答案 2 :(得分:0)
这是一种data.table
方法
library( data.table )
melt( setDT(df),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
# Gender group2 variable max min mean
# 1: m A IQ 120.739562935 83.46037366 96.99412720
# 2: f A IQ 98.657598754 98.43677811 98.54718843
# 3: f B IQ 111.973534436 71.38605822 94.04719457
# 4: m B IQ 102.913093964 102.91309396 102.91309396
# 5: m A Other 0.861929066 0.51651983 0.66098944
# 6: f A Other 0.752484881 0.07648229 0.41448359
# 7: f B Other 0.463524836 0.18308752 0.33301693
# 8: m B Other 0.099740011 0.09974001 0.09974001
# 9: m A Test 1.159379020 -0.83569116 0.04268551
# 10: f A Test -0.009017293 -0.77245300 -0.39073515
# 11: f B Test 1.591132150 -0.99248570 -0.24997246
# 12: m B Test 1.654489766 1.65448977 1.65448977
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.498788 1.819936 1.997320 1.980358 2.218809 2.413124 10
# tidyverse1 11.263956 11.887270 12.421442 11.963340 12.484075 15.401816 10
# tidyverse2 4.952477 5.185053 6.303103 6.001478 6.902558 9.663341 10
microbenchmark::microbenchmark(
data.table = {
DT <- copy(df)
melt( setDT(DT),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
},
tidyverse1 = {
DT <- copy(df)
df %>%
group_by(Gender, group2) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max)) %>%
gather(vars, vals, -Gender, -group2) %>%
separate(vars, c("Variable", "stat")) %>%
spread(stat, vals)
},
tidyverse2 = {
df %>%
gather(key = "variable", value = "value", -c(Gender, group2)) %>%
group_by(Gender, group2, variable) %>%
summarize_at("value", list(mean = mean, min = min, max = max)) %>%
ungroup()
},
times = 10
)