我有一个数据集,如下所示:
df <- tribble(
~id, ~price, ~number_of_book,
"1", 10, 3,
"1", 5, 1,
"2", 7, 4,
"2", 6, 2,
"2", 3, 4,
"3", 4, 1,
"4", 5, 1,
"4", 6, 1,
"5", 1, 2,
"5", 9, 3,
)
从数据集中可以看到,如果id为“ 1”,则有3本书的价格为每本10美元,而有1本书的价格为5美元。基本上,我想查看每个价格区的图书数量所占的百分比。这是我想要的数据集:
df <- tribble(
~id, ~less_than_three, ~three-five, ~five-six, ~more_than_six,
"1", "0%", "25%", "0%", "75%",
"2", "0%", "40%", "20%", "40%",
"3", "0%", "100%", "0%", "0%",
"4", "0%", "50%", "50%", "0%",
"5", "40%", "0%", "0%", "60%",
)
现在,我首先将价格汇总。为此,我运行以下代码:
out <- cut(df$price, breaks = c(0, 3, 5, 6, 10),
labels = c("<3","3-5","5-6", ">6"))
out = table(out) / sum(table(out))
但是不幸的是,由于缺乏编码知识,我无法走得更远。您能帮我获得所需的数据吗?
答案 0 :(得分:3)
我们可以使用cut
获取间隔,然后使用tidyr
将数据转换为宽格式,最后使用janitor
添加百分比。
library(dplyr)
library(tidyr)
library(janitor)
df %>%
mutate(interval = cut(price, c(0,3,5,6,Inf))) %>%
select(-price) %>%
pivot_wider(names_from = interval, values_from = number_of_book) %>%
adorn_percentages()
#> id (6,Inf] (3,5] (5,6] (0,3]
#> 1 0.75 0.25 NA NA
#> 2 0.40 NA 0.2 0.4
#> 3 NA 1.00 NA NA
#> 4 NA 0.50 0.5 NA
#> 5 0.60 NA NA 0.4
答案 1 :(得分:1)
使用dplyr,您可以添加一列cols
,该列将用作列名。然后,您可以将每个ID中每个列的书籍总数相加。接下来,您可以通过将这些数字除以该ID的总和来计算百分比,然后应用scales::percent
来格式化为百分比而不是十进制。现在,您只需要ivot_wider给出从中获取名称和值的变量,并对列进行重新排序以匹配原始标签顺序。 (这比其他答案要复杂一些,因为它考虑了给定(id,cols / interval)对的行数大于1,并且看门人简化了这种情况)
labels = c("less_than_three","three_to_five","five_to_six", "more_than_six")
df %>%
group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
summarise(n = sum(number_of_book)) %>%
group_by(id) %>%
mutate(pct = scales::percent(n/sum(n), 1)) %>%
pivot_wider(id_cols = id, names_from = cols, values_from = pct) %>%
select_at(c('id', labels)) %>%
ungroup
# # A tibble: 5 x 5
# id less_than_three three_to_five five_to_six more_than_six
# <chr> <chr> <chr> <chr> <chr>
# 1 1 NA 25% NA 75%
# 2 2 40% NA 20% 40%
# 3 3 NA 100% NA NA
# 4 4 NA 50% 50% NA
# 5 5 40% NA NA 60%
如果要将NA替换为0%(我认为在这种情况下是有意义的,并且与问题中显示的输出匹配),则可以使用下面的注释中提到的方法。
df %>%
group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
summarise(n = sum(number_of_book)) %>%
group_by(id) %>%
mutate(pct = scales::percent(n/sum(n), 1)) %>%
pivot_wider(id_cols = id, names_from = cols, values_from = pct,
values_fill = list(pct = '0%')) %>%
select_at(c('id', labels)) %>%
ungroup
# # A tibble: 5 x 5
# id less_than_three three_to_five five_to_six more_than_six
# <chr> <chr> <chr> <chr> <chr>
# 1 1 0% 57% 0% 43%
# 2 2 40% 0% 20% 40%
# 3 3 0% 100% 0% 0%
# 4 4 0% 50% 50% 0%
# 5 5 40% 0% 0% 60%