合并两个不相等的数据集以计算比例

时间:2019-11-14 13:25:39

标签: r

我知道关于计算每个组的比例存在类似的问题,但是它们在同一数据集中。我有两个数据集,一个数据集包含用户ID,日期和每天使用手机应用程序的总时长的信息。另一个ID包含相同的ID,日期,但每个应用类别的持续时间为每天(这意味着如果您按天对每个用户进行汇总,则它们将等于第一个数据集)

数据集1的输入量:

dat_1 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948, 
17949, 17950, 17951, 17952, 17953, 17954, 17955, 17956, 17957, 
17958, 17959, 17960, 17961, 17962, 17963, 17964, 17965, 17966, 
17967), class = "Date"), duration = structure(c(5212.71700000763, 
20655.6629965305, 14162.9649987221, 18286.7030012608, 15315.1349999905, 
17845.9039983749, 15864.4930007458, 14331.2430002689, 16331.9680001736, 
18098.3090002537, 20003.6570017338, 15547.8630020618, 18242.8340024948, 
24890.6929991245, 24226.1790001392, 26849.5739989281, 21208.1910011768, 
20396.9730014801, 24253.9579980373, 20673.4809997082), class = "difftime", units = "secs")), row.names = c(NA, 
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = "user_id", drop = TRUE, indices = list(
    0:19), group_sizes = 20L, biggest_group_size = 20L, labels = structure(list(
    user_id = 10161L), row.names = c(NA, -1L), class = "data.frame", vars = "user_id", drop = TRUE))

数据集2的投放量

dat_2 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948, 
17948, 17948, 17948, 17949, 17949, 17949, 17949, 17949, 17950, 
17950, 17950, 17950, 17951, 17951, 17951, 17951, 17952, 17952, 
17952), class = "Date"), categories = structure(c(1L, 2L, 3L, 
6L, 1L, 2L, 3L, 5L, 6L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 6L, 1L, 2L, 
3L), .Label = c("communication", "games & entertainment", "lifestyle", 
"news & information outlet", "social network", "utility & tools"
), class = "factor"), cat_duration = structure(c(1770.70500040054, 
1855.2380001545, 38.9109997749329, 1547.86299967766, 7010.0589993, 
10680.9569990635, 71.5590000152588, 741.676999807358, 2151.41099834442, 
5154.79599928856, 5501.70999979973, 116.311000108719, 3390.14799952507, 
12149.4220018387, 5009.53099989891, 371.340999603271, 756.408999919891, 
5633.53999876976, 8119.65800046921, 347.116999864578), class = "difftime", units = "secs")), row.names = c(NA, 
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("user_id", 
"date"), drop = TRUE, indices = list(0:3, 4:8, 9:12, 13:16, 17:19), group_sizes = c(4L, 
5L, 4L, 4L, 3L), biggest_group_size = 5L, labels = structure(list(
    user_id = c(10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948, 
    17949, 17950, 17951, 17952), class = "Date")), row.names = c(NA, 
-5L), class = "data.frame", vars = c("user_id", "date"), drop = TRUE))

我想为第二个数据集添加一个新列,该列根据每日持续时间显示每个类别的持续时间比例,如下所示:

     user_id date       categories            cat_duration     proportion 
     <int> <date>     <fct>                 <time>        
 1   10161 2019-02-21 communication          1770.705 secs       20%
 2   10161 2019-02-21 games & entertainment  1855.238 secs       21%
 3   10161 2019-02-21 lifestyle                38.911 secs       0.2%
 4   10161 2019-02-21 utility & tools        1547.863 secs       2%
 5   10161 2019-02-22 communication          7010.059 secs       14%
 6   10161 2019-02-22 games & entertainment 10680.957 secs       22%

但是,我尝试过这样,由于长度不同,我已经假定它不起作用:

category_duration$proportion <- (category_duration$cat_duration / daily_duration$duration)

,第二个参数本身也有问题,因为它是时间对象。错误是:“ /的第二个参数不能是“ difftime”对象。 预先感谢您的帮助!

2 个答案:

答案 0 :(得分:3)

我将以以下方式进行处理。这会将每日持续时间与类别持续时间结合起来,将difftime对象转换为数字,然后将两者相除。

category_duration %>%
  left_join(daily_duration, by = c("user_id", "date")) %>% 
  mutate(cat_duration_proportion = as.numeric(cat_duration, units = "secs") / as.numeric(duration, units = "secs"))

答案 1 :(得分:1)

您的列cat_durationduration不仅是数字,而且类型为difftime。这是一种用于时差的数据类型,不仅由数字组成,而且还由单位组成。

此答案对您有帮助吗? Divide two difftime objects