在带有“ IF”条件的数据表中的R中查找

时间:2019-01-07 22:57:55

标签: r join data.table

我有两个数据表。包含客户订单的表(显示客户ID和进行购买的订单日期)和包含客户细分的表(显示客户在特定时间段内划分为哪个细分市场)。

我想将数据表2)中的细分添加为数据表1)中的新变量,但是当然只有定单时客户所在的细分。

Customer_Orders <- data.table(
 customer_ID = c("A", "A"),
 order_date = c("2017-06-30", "2019-07-30")
)
head(Customer_Orders)
  customer_ID order_date
1:           A 2017-06-30
2:           A 2018-07-30


Customer_Segmentation <- data.table(
 customer_ID = c("A", "A", "A"),
 segment = c("1", "2", "3"),
 valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"),
 valid_until = c("2017-12-31", "2018-12-31", "2019-12-31")
)
head(Customer_Segmentation)
   customer_ID segment valid_from valid_until
1:           A       1  2017-01-01 2017-12-31
2:           A       2  2018-01-01 2018-12-31
3:           A       3  2019-01-01 2019-12-31

这是我正在寻找的手动构建结果

Result <- data.table(
 customer_ID = c("A", "A"),
 order_date = c("2017-06-30", "2019-07-30"),
 segment = c(1, 3)
)
head(Result)
   customer_ID order_date segment
1:           A 2017-06-30       1
2:           A 2019-07-30       3

当前,我的解决方案包括进行右联接,以将所有可能的细分基本上添加到客户订单表的每一行中,然后排除订单日期不在该细分时段之间的所有行。但是,由于我的数据集很大,所以这是一个非常缓慢且麻烦的解决方案

3 个答案:

答案 0 :(得分:3)

最简单的方法可能是使用软件包:

library(sqldf)
sqldf("select * from Customer_Orders
               left join Customer_Segmentation
               on order_date between valid_from and valid_until
               and Customer_Orders.ID = Customer_Segmentation.ID")


# customer_ID order_date customer_ID..3 segment valid_from valid_until
# 1           A 2017-06-31              A       1 2017-01-01  2017-12-31
# 2           A 2019-07-30              A       3 2019-01-01  2019-12-31

如果日期介于所提供的时间段之间,它将简单地联接到表中

但是如果您坚持使用,请看下面;

setkey(Customer_Segmentation,customer_ID,valid_from)
setkey(Customer_Orders,customer_ID,order_date)

 ans <- Customer_Segmentation[Customer_Orders,list(.valid_from=valid_from,
                                                    valid_until,order_date,segment),
                      by=.EACHI,roll=T][,`:=`(.valid_from=NULL)]

 ans


# customer_ID valid_from valid_until order_date segment
# 1:           A 2017-06-31  2017-12-31 2017-06-31       1
# 2:           A 2019-07-30  2019-12-31 2019-07-30       3

如果不需要的话,很容易去除多余的列。

答案 1 :(得分:0)

这怎么样?

您的数据(固定):

Stream

代码-前两个表仅用于使用library(tidyverse) library(lubridate) Customer_Orders <- tibble( customer_ID = c("A", "A"), order_date = c("2017-06-30", "2019-07-30")) Customer_Segmentation <- tibble( customer_ID = c("A", "A", "A"), segment = c("1", "2", "3"), valid_from = c("2017-01-01", "2018-01-01", "2019-01-01"), valid_until = c("2017-12-31", "2018-12-31", "2019-12-31")) 从初始表创建日期。下一个连接所有内容。

lubridate

这会根据间隔选择细分。

Customer_Orders2 <- Customer_Orders %>% 
  mutate(order_date = ymd(order_date))

Customer_Segmentation2 <- Customer_Segmentation %>% 
  mutate(valid_from = ymd(valid_from)) %>% 
  mutate(valid_until = ymd(valid_until))

Customer_Orders_join <- full_join(Customer_Orders2, Customer_Segmentation2)

这将产生:

Customer_Orders3 <- Customer_Orders_join %>% 
  filter(order_date %within% interval(valid_from, valid_until))

答案 2 :(得分:0)

这是我要解决的问题的方法:

数据生成(定义为正确的Date向量)

Customer_Orders <- data.table(
  customer_ID = c("A", "A"),
  order_date = as.Date(c("2017-06-30", "2019-07-30"))
)


Customer_Segmentation <- data.table(
  customer_ID = c("A", "A", "A"),
  segment = c("1", "2", "3"),
  valid_from =  as.Date(c("2017-01-01", "2018-01-01", "2019-01-01")),
  valid_until =  as.Date(c("2017-12-31", "2018-12-31", "2019-12-31"))
)

非最新更新加入以添加细分

使用A[B]支持的data.table语法时,通过使用{{将B表中的单个列添加到原始A表中相对简单1}}在i.中引用列的前缀。其余部分只是B语句,可以使用on中的.()表示法将其定义为列表,并具有任意数量的条件。

data.table