在下面的示例中,我有一个用户表和一个事务表,其中一个用户可以拥有0,1或更多事务。我在users表上使用mult='first'
执行join + update,尝试插入一个列,指示每个用户首次发生事务的日期。
library(data.table) # v1.10.4
# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")
# Convert date columns to Date type
users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
transactions[, TransactionDate := as.Date(TransactionDate)]
users
UserID User Gender Registered Cancelled FirstTransactionDate
1: 1 Charles male 2012-12-21 <NA> 2012-08-26
2: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23
3: 3 Caroline female 2012-10-23 2016-06-07 2016-05-08
4: 4 Brielle female 2013-07-17 <NA> <NA>
5: 5 Benjamin male 2010-11-25 <NA> <NA>
transactions
TransactionID TransactionDate UserID ProductID Quantity
1: 1 2010-08-21 7 2 1
2: 2 2011-05-26 3 4 1
3: 3 2011-06-16 3 3 1
4: 4 2012-08-26 1 2 3
5: 5 2013-06-06 2 4 1
6: 6 2013-12-23 2 5 6
7: 7 2013-12-30 3 4 1
8: 8 2014-04-24 NA 2 3
9: 9 2015-04-24 7 4 3
10: 10 2016-05-08 3 4 4
##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]
# Unexpected result
users[UserID == 2]
UserID User Gender Registered Cancelled FirstTransactionDate
1: 2 Pedro male 2010-08-01 2010-08-08 2013-12-23 # <- shouldn't this be 2013-06-06?
为什么当事务表中的早期事务与该用户绑定时,为用户2设置了FirstTransactionDate 2013-12-23
?这是一个错误吗?
答案 0 :(得分:4)
更仔细地阅读data.table
mult
的文档,它说:
当我是一个列表(或data.frame或data.table)和x中的多行时 匹配i中的行,返回多个控件:&#34; all&#34; (默认),&#34;第一个&#34;或者&#34;最后&#34;。
因此,如果X中有多个行(users
)与i(transactions
)匹配,则mult
将返回X中的第一行。但是,在您的情况下, X中没有与i匹配的多行,而是i中有多行与X匹配。
正如@Arun建议的那样,最好的选择是改变您的身份,以便mult = "first"
相关:
users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]
users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>
另一种选择是略微改变你的合并:
users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]
我只是在transactions
数据集中创建第一个交易日期。这会多次合并,但应该没问题,因为UserID
的值总是相同。