我有以下具有3个变量和几个观察结果的数据框
data <- read.table(text="
YEAR SECTOR VALUE
2016 A 2
2016 A 5
2016 A 10
2016 A 20
2016 A 50
2016 A 100
2016 A 200
2016 A 300
2016 B 20
2016 B 50
2016 B 100
2016 B 200
2016 B 500
2016 B 1000
2016 B 2000
2016 B 3000
2017 A 21
2017 A 51
2017 A 101
2017 A 201
2017 A 501
2017 A 1001
2017 A 2001
2017 A 3001
2017 B 201
2017 B 501
2017 B 1001
2017 B 2001
2017 B 5001
2016 B 10001
2017 B 20001
2017 B 30001",
header=TRUE)
我想计算每个YEAR
+ SECTOR
中的第一四分位数,中位数和第三四分位数
为了保险起见,Sector
A
和YEAR
2016
的第一个四分位数将根据5
返回(2,5,10,20,50,100,200,300)
。
答案 0 :(得分:0)
一种选择是按“年”,“部门”分组,将fivenum
的子集存储在tibble
,unnest
中,然后spread
到“宽”格式
library(dplyr)
library(tidyr)
df1 %>%
group_by(YEAR, SECTOR) %>%
group_map(~ .x %>%
summarise(val = list(tibble(categ = c('1st quart', 'median', '3rd quart'),
val = fivenum(VALUE)[2:4])))) %>%
unnest %>%
spread(categ, val)
# A tibble: 4 x 5
# Groups: YEAR, SECTOR [4]
# YEAR SECTOR `1st quart` `3rd quart` median
# <int> <chr> <dbl> <dbl> <dbl>
#1 2016 A 7.5 150 35
#2 2016 B 100 2000 500
#3 2017 A 76 1501 351
#4 2017 B 751 12501 2001
df1 <- structure(list(YEAR = c(2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2016L, 2017L, 2017L), SECTOR = c("A",
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
"B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "B", "B"), VALUE = c(2L, 5L, 10L, 20L, 50L, 100L,
200L, 300L, 20L, 50L, 100L, 200L, 500L, 1000L, 2000L, 3000L,
21L, 51L, 101L, 201L, 501L, 1001L, 2001L, 3001L, 201L, 501L,
1001L, 2001L, 5001L, 10001L, 20001L, 30001L)), class = "data.frame",
row.names = c(NA,
-32L))
答案 1 :(得分:0)
如何?
library(dplyr)
data %>%
group_by(SECTOR,YEAR) %>%
summarise(median = summary(VALUE)[3],
q1 = summary(VALUE)[2],
q3 = summary(VALUE)[5])
但是,根据summary()
,您提供的示例的第一个分位数应为8.75
答案 2 :(得分:0)
probs = c(0.25, 0.5, 0.75)
ans = Reduce(function(x1, x2) merge(x1, x2, by = c("YEAR", "SECTOR")),
lapply(probs, function(p)
aggregate(x = setNames(list(df1$VALUE), paste0("Q_",p)),
by = df1[c("YEAR", "SECTOR")],
FUN = function(x) quantile(x, probs = p))))
ans
# YEAR SECTOR Q_0.25 Q_0.5 Q_0.75
#1 2016 A 8.75 35 125
#2 2016 B 100.00 500 2000
#3 2017 A 88.50 351 1251
#4 2017 B 751.00 2001 12501
答案 3 :(得分:0)
另一种方法是使用quantile()
函数和dplyr
:
library(dplyr)
data %>%
group_by(SECTOR, YEAR) %>%
summarize(q1 = quantile(VALUE)[1],
median = quantile(VALUE)[2],
q3 = quantile(VALUE)[3])
## SECTOR YEAR q1 median med q3
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 2016 2 8.75 35 35
## 2 A 2017 21 88.5 351 351
## 3 B 2016 20 100 500 500
## 4 B 2017 201 751 2001 2001