Question

我有一个数据集，如下所示：

df <- tribble(
  ~id,  ~price, ~type, ~number_of_book,        
  "1",    10,     "X",        3,    
  "1",     2,     "X",        1, 
  "1",     5,     "Y",        1,         
  "2",     7,     "X",        4,
  "2",     6,     "X",        1,
  "2",     6,     "Y",        2, 
  "3",     2,     "X",        4,
  "3",     8,     "X",        2,
  "3",     1,     "Y",        4,
  "3",     9,     "Y",        5,
)

现在，我想回答这个问题：对于每个ID和每个选定的价格组，X的图书百分比是多少，Y的百分比是多少？换句话说，每个ID和价格组的图书类型分布是什么？

要做到这一点，首先，我需要在脑海中想象一下该数据集：

agg_df <- tribble(
  ~type,     ~id,       ~less_than_two,    ~two-five,  ~five-six, ~more_than_six,     
    "X",      "1",              1,               0,           0,            3,
    "Y",      "1",              0,               1,           0,            0,
    "X",      "2",              0,               0,           1,            4,
    "Y",      "2",              0,               0,           2,            2,
    "X",      "3",              4,               0,           0,            2,
    "Y",      "3",              4,               0,           0,            5,
)

然后，这就是我想要的数据集：

desired_df <- tribble(
  ~type,     ~id,       ~less_than_two,  ~three-five,  ~five-six, ~more_than_six,     
  "X",      "1",            "100%",           "0%",          "0%",       "100%",
  "Y",      "1",              "0%",         "100%",          "0%",         "0%",
  "X",      "2",              "0%",           "0%",       "33.3%",      "66.6%",
  "Y",      "2",              "0%",           "0%",       "66.6%",       "33.3%",
  "X",      "3",             "50%",           "0%",          "0%",      "28.5%",
  "Y",      "3",             "50%",           "0%",          "0%",       "71.4%",
)

这个期望的数据集向我显示，当id为“ 3”且价格容器超过6美元时，有两本X型书，但五本Y型书。因此，这是分布：X（28.5％）和Y（71.4％）。

注意：我在这里有一个类似的问题，但是现在它更加复杂，我无法设法解决：How to manipulate (aggregate) the data in R?

如果您能帮助我，我将不胜感激。提前致谢。

Answer 1

我们可以在“价格”列上使用cut创建一个bin组，并按“ id”，“ grp”分组，通过将“ number_of_book”除以“ {number_of_book”的sum”来创建百分比”，然后重塑为“宽”格式

library(dplyr)
library(tidyr)
df %>% 
  group_by(id,grp = cut(price, breaks = c(-Inf, 2, 5, 6, Inf), 
    c('less_than_two', 'three-five', 'five-six', 'more_than_six')), add = TRUE) %>%
  mutate(Perc = 100 *number_of_book/sum(number_of_book)) %>%
  select(-price, -number_of_book) %>%
  mutate(rn = row_number()) %>%
  pivot_wider(names_from = grp, values_from = Perc, values_fill = list(Perc = 0)) %>%
  select(-rn)
# A tibble: 6 x 6
# Groups:   id [3]
#  id    type  more_than_six less_than_two `three-five` `five-six`
#  <chr> <chr>         <dbl>         <dbl>        <dbl>      <dbl>
#1 1     X             100             100            0        0  
#2 1     Y               0               0          100        0  
#3 2     X             100               0            0       33.3
#4 2     Y               0               0            0       66.7
#5 3     X              28.6            50            0        0  
#6 3     Y              71.4            50            0        0

Answer 2

我们可以使用price将findInterval划分为不同的组，为sum，number_of_book和{{ 1}}，然后计算每个id和type的比率。最后，我们使用price_group获得了更宽格式的数据。

id

Answer 3

也许不是完美的解决方案，但是另一种方法是使用import requests data = { 'apikey': '2d8b3b803594b13e02a7dc827f4a63f8', 'fields': 'settlement,previousClose,previousOpenInterest', 'symbols': 'ZCY00,ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9,ZC*10,ZC*11,ZC*12,ZC*13,ZC*14,ZC*15,ZC*16,ZC*17,ZC*18,ZC*19,ZC*20,ZC*21,ZC*22,ZC*23,ZC*24,ZC*25,ZC*26,ZC*27,ZC*28,ZC*29,ZC*30,ZC*31,ZC*32,ZC*33,ZC*34,ZC*35,ZC*36,ZC*37,ZC*38,ZC*39,ZC*40,ZC*41,ZC*42,ZC*43,ZC*44,ZC*45,ZC*46,ZC*47,ZC*48,ZC*49,ZC*50' } r = requests.post( 'https://ondemand.websol.barchart.com/getQuote.json', data=data).json() for item in r['results']: print(item)定义不同的类别：

case_when

如何为R中的每一行分配数据？

3 个答案: