我有一个数据集,如下所示:
df <- tribble(
~id, ~price, ~type, ~number_of_book,
"1", 10, "X", 3,
"1", 2, "X", 1,
"1", 5, "Y", 1,
"2", 7, "X", 4,
"2", 6, "X", 1,
"2", 6, "Y", 2,
"3", 2, "X", 4,
"3", 8, "X", 2,
"3", 1, "Y", 4,
"3", 9, "Y", 5,
)
现在,我想回答这个问题:对于每个ID和每个选定的价格组,X的图书百分比是多少,Y的百分比是多少?换句话说,每个ID和价格组的图书类型分布是什么?
要做到这一点,首先,我需要在脑海中想象一下该数据集:
agg_df <- tribble(
~type, ~id, ~less_than_two, ~two-five, ~five-six, ~more_than_six,
"X", "1", 1, 0, 0, 3,
"Y", "1", 0, 1, 0, 0,
"X", "2", 0, 0, 1, 4,
"Y", "2", 0, 0, 2, 2,
"X", "3", 4, 0, 0, 2,
"Y", "3", 4, 0, 0, 5,
)
然后,这就是我想要的数据集:
desired_df <- tribble(
~type, ~id, ~less_than_two, ~three-five, ~five-six, ~more_than_six,
"X", "1", "100%", "0%", "0%", "100%",
"Y", "1", "0%", "100%", "0%", "0%",
"X", "2", "0%", "0%", "33.3%", "66.6%",
"Y", "2", "0%", "0%", "66.6%", "33.3%",
"X", "3", "50%", "0%", "0%", "28.5%",
"Y", "3", "50%", "0%", "0%", "71.4%",
)
这个期望的数据集向我显示,当id为“ 3”且价格容器超过6美元时,有两本X型书,但五本Y型书。因此,这是分布:X(28.5%)和Y(71.4%)。
注意:我在这里有一个类似的问题,但是现在它更加复杂,我无法设法解决:How to manipulate (aggregate) the data in R?
如果您能帮助我,我将不胜感激。提前致谢。
答案 0 :(得分:2)
我们可以在“价格”列上使用cut
创建一个bin组,并按“ id”,“ grp”分组,通过将“ number_of_book”除以“ {number_of_book”的sum
”来创建百分比”,然后重塑为“宽”格式
library(dplyr)
library(tidyr)
df %>%
group_by(id,grp = cut(price, breaks = c(-Inf, 2, 5, 6, Inf),
c('less_than_two', 'three-five', 'five-six', 'more_than_six')), add = TRUE) %>%
mutate(Perc = 100 *number_of_book/sum(number_of_book)) %>%
select(-price, -number_of_book) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = grp, values_from = Perc, values_fill = list(Perc = 0)) %>%
select(-rn)
# A tibble: 6 x 6
# Groups: id [3]
# id type more_than_six less_than_two `three-five` `five-six`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 X 100 100 0 0
#2 1 Y 0 0 100 0
#3 2 X 100 0 0 33.3
#4 2 Y 0 0 0 66.7
#5 3 X 28.6 50 0 0
#6 3 Y 71.4 50 0 0
答案 1 :(得分:2)
我们可以使用price
将findInterval
划分为不同的组,为sum
,number_of_book
和{{ 1}},然后计算每个id
和type
的比率。最后,我们使用price_group
获得了更宽格式的数据。
id
答案 2 :(得分:1)
也许不是完美的解决方案,但是另一种方法是使用import requests
data = {
'apikey': '2d8b3b803594b13e02a7dc827f4a63f8',
'fields': 'settlement,previousClose,previousOpenInterest',
'symbols': 'ZCY00,ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9,ZC*10,ZC*11,ZC*12,ZC*13,ZC*14,ZC*15,ZC*16,ZC*17,ZC*18,ZC*19,ZC*20,ZC*21,ZC*22,ZC*23,ZC*24,ZC*25,ZC*26,ZC*27,ZC*28,ZC*29,ZC*30,ZC*31,ZC*32,ZC*33,ZC*34,ZC*35,ZC*36,ZC*37,ZC*38,ZC*39,ZC*40,ZC*41,ZC*42,ZC*43,ZC*44,ZC*45,ZC*46,ZC*47,ZC*48,ZC*49,ZC*50'
}
r = requests.post(
'https://ondemand.websol.barchart.com/getQuote.json', data=data).json()
for item in r['results']:
print(item)
定义不同的类别:
case_when