我有两个数据表。包含客户订单的表(显示客户ID和进行购买的订单日期)和包含客户细分的表(显示客户在特定时间段内划分为哪个细分市场)。
我想将数据表2)中的细分添加为数据表1)中的新变量,但是当然只有定单时客户所在的细分。
Customer_Orders <- data.table(
customer_ID = c("A", "A"),
order_date = c("2017-06-30", "2019-07-30")
)
head(Customer_Orders)
customer_ID order_date
1: A 2017-06-30
2: A 2018-07-30
Customer_Segmentation <- data.table(
customer_ID = c("A", "A", "A"),
segment = c("1", "2", "3"),
valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
valid_until = c("2017-12-31", "2018-12-31", "2019-12-31")
)
head(Customer_Segmentation)
customer_ID segment valid_from valid_until
1: A 1 2017-01-01 2017-12-31
2: A 2 2018-01-01 2018-12-31
3: A 3 2019-01-01 2019-12-31
这是我正在寻找的手动构建结果
Result <- data.table(
customer_ID = c("A", "A"),
order_date = c("2017-06-30", "2019-07-30"),
segment = c(1, 3)
)
head(Result)
customer_ID order_date segment
1: A 2017-06-30 1
2: A 2019-07-30 3
当前,我的解决方案包括进行右联接,以将所有可能的细分基本上添加到客户订单表的每一行中,然后排除订单日期不在该细分时段之间的所有行。但是,由于我的数据集很大,所以这是一个非常缓慢且麻烦的解决方案
答案 0 :(得分:3)
最简单的方法可能是使用sqldf软件包:
library(sqldf)
sqldf("select * from Customer_Orders
left join Customer_Segmentation
on order_date between valid_from and valid_until
and Customer_Orders.ID = Customer_Segmentation.ID")
# customer_ID order_date customer_ID..3 segment valid_from valid_until
# 1 A 2017-06-31 A 1 2017-01-01 2017-12-31
# 2 A 2019-07-30 A 3 2019-01-01 2019-12-31
如果日期介于所提供的时间段之间,它将简单地联接到表中
但是如果您坚持使用data.table,请看下面;
setkey(Customer_Segmentation,customer_ID,valid_from)
setkey(Customer_Orders,customer_ID,order_date)
ans <- Customer_Segmentation[Customer_Orders,list(.valid_from=valid_from,
valid_until,order_date,segment),
by=.EACHI,roll=T][,`:=`(.valid_from=NULL)]
ans
# customer_ID valid_from valid_until order_date segment
# 1: A 2017-06-31 2017-12-31 2017-06-31 1
# 2: A 2019-07-30 2019-12-31 2019-07-30 3
如果不需要的话,很容易去除多余的列。
答案 1 :(得分:0)
这怎么样?
您的数据(固定):
Stream
代码-前两个表仅用于使用library(tidyverse)
library(lubridate)
Customer_Orders <- tibble(
customer_ID = c("A", "A"),
order_date = c("2017-06-30", "2019-07-30"))
Customer_Segmentation <- tibble(
customer_ID = c("A", "A", "A"),
segment = c("1", "2", "3"),
valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
valid_until = c("2017-12-31", "2018-12-31", "2019-12-31"))
从初始表创建日期。下一个连接所有内容。
lubridate
这会根据间隔选择细分。
Customer_Orders2 <- Customer_Orders %>%
mutate(order_date = ymd(order_date))
Customer_Segmentation2 <- Customer_Segmentation %>%
mutate(valid_from = ymd(valid_from)) %>%
mutate(valid_until = ymd(valid_until))
Customer_Orders_join <- full_join(Customer_Orders2, Customer_Segmentation2)
这将产生:
Customer_Orders3 <- Customer_Orders_join %>%
filter(order_date %within% interval(valid_from, valid_until))
答案 2 :(得分:0)
这是我要解决的问题的方法:
Date
向量)Customer_Orders <- data.table(
customer_ID = c("A", "A"),
order_date = as.Date(c("2017-06-30", "2019-07-30"))
)
Customer_Segmentation <- data.table(
customer_ID = c("A", "A", "A"),
segment = c("1", "2", "3"),
valid_from = as.Date(c("2017-01-01", "2018-01-01", "2019-01-01")),
valid_until = as.Date(c("2017-12-31", "2018-12-31", "2019-12-31"))
)
使用A[B]
支持的data.table
语法时,通过使用{{将B
表中的单个列添加到原始A
表中相对简单1}}在i.
中引用列的前缀。其余部分只是B
语句,可以使用on
中的.()
表示法将其定义为列表,并具有任意数量的条件。
data.table