Question

我正在使用一个大型的电子数据库来记录用户驱动的事件。从本质上讲，我想获得全年中每月使用该服务的新用户的比例/百分比。以下是数据的模拟示例：-

    UserId    Month   UserEventId

    Tyrhjj01   Jan     0998907
    Fghhey21   Jan     0989892
    Hyhkio52   Jan     7782901
    hejdoe78   Jan     3889201
    Tyrhjj01   Feb     7829930
    sjjwilsn   Feb     7728910
    Tyrhjj01   Feb     9203749
    nnkilo89   Feb     7728912
    Fghhey21   Feb     4463782

...等等。如您所见，有些客户定期使用该服务，而有些客户在2月是唯一的。我想获得一定比例的老客户和系统唯一的客户。插图，以帮助更好地理解。

新客户与旧客户的百分比：

。

我尝试了dplyr和data.table的几个示例，但无济于事。任何帮助将不胜感激！

Answer 1

如果您创建了每个月都有唯一用户的新数据集，则可以使用rowid中的data.table来查看它们在前几个月中是否出现在df中。

library(data.table)
setDT(df)

users <- df[, .(user = unique(UserId)), Month] 
users[, visit := rowid(user)] # create variable for number of months user has visited
users[, .(new_pct = mean(visit == 1)), Month] 

#    Month new_pct
# 1:   Jan     1.0
# 2:   Feb     0.5

或与tidyverse

编辑：如果您的Month列实际上是字符月份名称，则下面的解决方案不起作用。如下所示，dplyr分组对数据进行重新排序（与data.table不同），因此使用此方法会产生不正确的结果。我将在下面留下代码，因为如果Month是日期类列，那么它将起作用。

df %>% 
  group_by(Month) %>% 
  do(user = unique(.$UserId)) %>% 
  unnest %>% 
  group_by(user) %>% 
  mutate(visit = row_number()) %>% 
  group_by(Month) %>% 
  summarise(new_pct = mean(visit == 1))

# # A tibble: 2 x 2
#   Month new_pct
#   <chr>   <dbl>
# 1 Feb     1.00 
# 2 Jan     0.500

使用的数据：

df <- fread("
UserId    Month   UserEventId
Tyrhjj01   Jan     0998907
Fghhey21   Jan     0989892
Hyhkio52   Jan     7782901
hejdoe78   Jan     3889201
Tyrhjj01   Feb     7829930
sjjwilsn   Feb     7728910
Tyrhjj01   Feb     9203749
nnkilo89   Feb     7728912
Fghhey21   Feb     4463782
")

在R中按月检索客户的唯一比例

1 个答案: