我有数据框(df),它具有设备ID和本地日期列。我想将用户ID分配给始终在所有本地日期一起显示的设备ID。我在下面提供了示例
device_id <- c("x1", "x1", "x1", "x2", "x2", "x3", "x3", "x3", "x4", "x4", "x5",
"x5", "x5", "x5", "x5", "x5", "x5", "x6", "x6", "x7", "x7", "x8",
"x8", "x9", "x9", "x9")
local_date <- c("2019-01-13", "2019-01-14", "2019-01-15", "2019-01-03", "2019-01-04",
"2019-01-10", "2019-01-11", "2019-01-12", "2019-01-11", "2019-01-12",
"2019-01-03", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-08",
"2019-01-13", "2019-01-23", "2019-01-03", "2019-01-04", "2019-10-23",
"2019-10-28", "2019-10-23", "2019-10-28", "2019-01-13", "2019-01-14",
"2019-01-15")
df <- data.frame(device_id, local_date)
df$local_date <- as.Date(df$local_date)
这是我要创建的数据框。
expected_df <- data.frame(device_id=c("x1", "x9", "x2", "x6", "x3", "x4", "x5", "x7", "x8"),
user_id=c(1, 1, 2, 2, 3, 4, 5, 6, 6))
expected_df
# device_id user_id
# 1 x1 1
# 2 x9 1
# 3 x2 2
# 4 x6 2
# 5 x3 3
# 6 x4 4
# 7 x5 5
# 8 x7 6
# 9 x8 6
如果您观察到x1和x9都在相同的本地日期一起出现,这就是为什么它们被分配了相同的用户ID的原因... x7和x8就是这种情况。
我该如何实现?
答案 0 :(得分:5)
以下内容如何:
library(tidyverse)
df %>% group_by(device_id) %>%
mutate(footprint=paste(sort(as.character(local_date)), collapse=";")) %>%
ungroup %>%
mutate(id=as.numeric(factor(footprint))) %>%
filter(!duplicated(device_id)) %>% arrange(id)
说明:
我们为每台设备创建一个足迹字符串-看到该设备的日期。接下来,我们根据足迹分配数字ID(借助factor
)。
基本R:
d2id <- tapply(df$local_date, df$device_id, function(x) paste(x, collapse=";"))
d2id <- data.frame(names(d2id), id=d2id)
d2id$id <- as.numeric(factor(d2id$id))
d2id <- d2id[ order(d2id$id), ]
答案 1 :(得分:2)
使用@January的基本逻辑,另一个tidyverse
可能是:
df %>%
group_by(device_id) %>%
summarise(footprint = str_c(str_sort(local_date), collapse = ";")) %>%
ungroup() %>%
transmute(device_id,
user_id = group_indices(., footprint))
device_id user_id
<chr> <int>
1 x1 5
2 x2 1
3 x3 3
4 x4 4
5 x5 2
6 x6 1
7 x7 6
8 x8 6
9 x9 5