我知道关于计算每个组的比例存在类似的问题,但是它们在同一数据集中。我有两个数据集,一个数据集包含用户ID,日期和每天使用手机应用程序的总时长的信息。另一个ID包含相同的ID,日期,但每个应用类别的持续时间为每天(这意味着如果您按天对每个用户进行汇总,则它们将等于第一个数据集)
数据集1的输入量:
dat_1 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948,
17949, 17950, 17951, 17952, 17953, 17954, 17955, 17956, 17957,
17958, 17959, 17960, 17961, 17962, 17963, 17964, 17965, 17966,
17967), class = "Date"), duration = structure(c(5212.71700000763,
20655.6629965305, 14162.9649987221, 18286.7030012608, 15315.1349999905,
17845.9039983749, 15864.4930007458, 14331.2430002689, 16331.9680001736,
18098.3090002537, 20003.6570017338, 15547.8630020618, 18242.8340024948,
24890.6929991245, 24226.1790001392, 26849.5739989281, 21208.1910011768,
20396.9730014801, 24253.9579980373, 20673.4809997082), class = "difftime", units = "secs")), row.names = c(NA,
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = "user_id", drop = TRUE, indices = list(
0:19), group_sizes = 20L, biggest_group_size = 20L, labels = structure(list(
user_id = 10161L), row.names = c(NA, -1L), class = "data.frame", vars = "user_id", drop = TRUE))
数据集2的投放量
dat_2 <- structure(list(user_id = c(10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L,
10161L, 10161L, 10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948,
17948, 17948, 17948, 17949, 17949, 17949, 17949, 17949, 17950,
17950, 17950, 17950, 17951, 17951, 17951, 17951, 17952, 17952,
17952), class = "Date"), categories = structure(c(1L, 2L, 3L,
6L, 1L, 2L, 3L, 5L, 6L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 6L, 1L, 2L,
3L), .Label = c("communication", "games & entertainment", "lifestyle",
"news & information outlet", "social network", "utility & tools"
), class = "factor"), cat_duration = structure(c(1770.70500040054,
1855.2380001545, 38.9109997749329, 1547.86299967766, 7010.0589993,
10680.9569990635, 71.5590000152588, 741.676999807358, 2151.41099834442,
5154.79599928856, 5501.70999979973, 116.311000108719, 3390.14799952507,
12149.4220018387, 5009.53099989891, 371.340999603271, 756.408999919891,
5633.53999876976, 8119.65800046921, 347.116999864578), class = "difftime", units = "secs")), row.names = c(NA,
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = c("user_id",
"date"), drop = TRUE, indices = list(0:3, 4:8, 9:12, 13:16, 17:19), group_sizes = c(4L,
5L, 4L, 4L, 3L), biggest_group_size = 5L, labels = structure(list(
user_id = c(10161L, 10161L, 10161L, 10161L, 10161L), date = structure(c(17948,
17949, 17950, 17951, 17952), class = "Date")), row.names = c(NA,
-5L), class = "data.frame", vars = c("user_id", "date"), drop = TRUE))
我想为第二个数据集添加一个新列,该列根据每日持续时间显示每个类别的持续时间比例,如下所示:
user_id date categories cat_duration proportion
<int> <date> <fct> <time>
1 10161 2019-02-21 communication 1770.705 secs 20%
2 10161 2019-02-21 games & entertainment 1855.238 secs 21%
3 10161 2019-02-21 lifestyle 38.911 secs 0.2%
4 10161 2019-02-21 utility & tools 1547.863 secs 2%
5 10161 2019-02-22 communication 7010.059 secs 14%
6 10161 2019-02-22 games & entertainment 10680.957 secs 22%
但是,我尝试过这样,由于长度不同,我已经假定它不起作用:
category_duration$proportion <- (category_duration$cat_duration / daily_duration$duration)
,第二个参数本身也有问题,因为它是时间对象。错误是:“ /的第二个参数不能是“ difftime”对象。 预先感谢您的帮助!
答案 0 :(得分:3)
我将以以下方式进行处理。这会将每日持续时间与类别持续时间结合起来,将difftime
对象转换为数字,然后将两者相除。
category_duration %>%
left_join(daily_duration, by = c("user_id", "date")) %>%
mutate(cat_duration_proportion = as.numeric(cat_duration, units = "secs") / as.numeric(duration, units = "secs"))
答案 1 :(得分:1)
您的列cat_duration
和duration
不仅是数字,而且类型为difftime
。这是一种用于时差的数据类型,不仅由数字组成,而且还由单位组成。
此答案对您有帮助吗? Divide two difftime objects