data.table join + update with mult =' first'出乎意料的结果

时间:2017-05-02 21:31:18

标签: r data.table

在下面的示例中,我有一个用户表和一个事务表,其中一个用户可以拥有0,1或更多事务。我在users表上使用mult='first'执行join + update,尝试插入一个列,指示每个用户首​​次发生事务的日期。

library(data.table)  # v1.10.4

# Download data
users <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/users.csv")
transactions <- transactions <- fread("https://raw.githubusercontent.com/ben519/DataWrangling/master/Data/transactions.csv")

# Convert date columns to Date type
    users[, `:=`(Registered = as.Date(Registered), Cancelled = as.Date(Cancelled))]
    transactions[, TransactionDate := as.Date(TransactionDate)]

users
   UserID     User Gender Registered  Cancelled FirstTransactionDate
1:      1  Charles   male 2012-12-21       <NA>           2012-08-26
2:      2    Pedro   male 2010-08-01 2010-08-08           2013-12-23
3:      3 Caroline female 2012-10-23 2016-06-07           2016-05-08
4:      4  Brielle female 2013-07-17       <NA>                 <NA>
5:      5 Benjamin   male 2010-11-25       <NA>                 <NA>

transactions
    TransactionID TransactionDate UserID ProductID Quantity
 1:             1      2010-08-21      7         2        1
 2:             2      2011-05-26      3         4        1
 3:             3      2011-06-16      3         3        1
 4:             4      2012-08-26      1         2        3
 5:             5      2013-06-06      2         4        1
 6:             6      2013-12-23      2         5        6
 7:             7      2013-12-30      3         4        1
 8:             8      2014-04-24     NA         2        3
 9:             9      2015-04-24      7         4        3
10:            10      2016-05-08      3         4        4

##### For each user, insert the TransactionDate of the first matching row
users[transactions, FirstTransactionDate := i.TransactionDate, on="UserID", mult="first"]

# Unexpected result
users[UserID == 2]
   UserID  User Gender Registered  Cancelled FirstTransactionDate
1:      2 Pedro   male 2010-08-01 2010-08-08           2013-12-23  # <- shouldn't this be 2013-06-06?

为什么当事务表中的早期事务与该用户绑定时,为用户2设置了FirstTransactionDate 2013-12-23?这是一个错误吗?

1 个答案:

答案 0 :(得分:4)

更仔细地阅读data.table mult的文档,它说:

  

当我是一个列表(或data.frame或data.table)和x中的多行时   匹配i中的行,返回多个控件:&#34; all&#34;   (默认),&#34;第一个&#34;或者&#34;最后&#34;。

因此,如果X中有多个行(users)与i(transactions)匹配,则mult将返回X中的第一行。但是,在您的情况下, X中没有与i匹配的多行,而是i中有多行与X匹配。

正如@Arun建议的那样,最好的选择是改变您的身份,以便mult = "first"相关:

users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]

users
#   UserID     User Gender Registered  Cancelled FirstTransactionDate
#1:      1  Charles   male 2012-12-21       <NA>           2012-08-26
#2:      2    Pedro   male 2010-08-01 2010-08-08           2013-06-06
#3:      3 Caroline female 2012-10-23 2016-06-07           2011-05-26
#4:      4  Brielle female 2013-07-17       <NA>                 <NA>
#5:      5 Benjamin   male 2010-11-25       <NA>                 <NA>

另一种选择是略微改变你的合并:

users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
      FirstTransactionDate := FirstTransactionDate, on="UserID"]

我只是在transactions数据集中创建第一个交易日期。这会多次合并,但应该没问题,因为UserID的值总是相同。