寻找新用户数量

时间:2018-07-17 14:19:13

标签: r

我正在使用R,并且有一个玩某些在线游戏的人的数据表。

userId,  login,      country
132,     2017-01-01, A
133,     2017-01-01, B
133,     2018-01-01, B
432,     2018-01-01, A

我想查找每个国家/地区在2018年的新用户数量,定义为2018年而非2017年登录的用户数量。例如,如果上述数据表是整个数据表,则国家/地区A在2018年将有1个新用户(用户432),而国家B将有0个新用户(因为用户133在2017年登录)。

最快的方法是什么?

3 个答案:

答案 0 :(得分:3)

如果数据集很大,使用data.table可能是最快的

library(data.table)
setDT(data)
data[, login := as.Date(login)]
data[, .(year = min(year(login)), country), by = userId
     ][, sum(year == 2018), by = country]
   country V1
1:       A  1
2:       B  0

数据在哪里:

data <- fread("userId,  login,      country
132,     2017-01-01, A
133,     2017-01-01, B
133,     2018-01-01, B
432,     2018-01-01, A")

编辑:在dplyr中使用类似的逻辑(结果更加冗长):

data %>% 
  mutate(year = year(as.Date(login))) %>%
  group_by(userId) %>%
  summarise(myear = min(year), country = unique(country)) %>%
  group_by(country) %>%
  summarise(n_new_users = sum(myear == 2018))

  country n_new_users
  <chr>         <int>
1 A                 1
2 B                 0

Edit2:在基本R中使用类似的逻辑(也许不是最好的)(有些管道使跟踪更容易):

data$year <- as.integer(substr(data$login, 1, 4))
data %>% 
  aggregate(year ~ userId + country, ., min) %>%
  aggregate(year ~ country, ., function(x) sum(x == 2018))
  country year
1       A    1
2       B    0

答案 1 :(得分:1)

这是我的选择:

require(dplyr)
require(lubridate)
data %>%
  mutate(years = year(as.Date(login))) %>%
  group_by(userId) %>%
  mutate(n = n()) %>% # n will be >1 if a user is not new 
  filter(n == 1, years == "2018") %>% # filter for n == 1 and year 2018
  group_by(country) %>% 
  count()

答案 2 :(得分:0)


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

text <-
  "userId,  login,      country
  132,     2017-01-01, A
  133,     2017-01-01, B
  133,     2018-01-01, B
  432,     2018-01-01, A"

df <- read.csv(text = text, stringsAsFactors = F) %>%
  mutate(yr = as.numeric(gsub("-.*", "", login)))

svnt_peeps <- df %>% filter(yr == 2017)

df %>%
  filter(yr == 2018) %>%
  anti_join(svnt_peeps, "userId") %>%
  group_by(country) %>%
  count()
#> # A tibble: 1 x 2
#> # Groups:   country [1]
#>   country     n
#>   <chr>   <int>
#> 1 " A"        1