从另一个数据帧更新数据帧

时间:2015-08-05 03:36:50

标签: r join dataframe

我有2个表,“交易”行超过500M,“客户”超过3M行

data <- data.frame(Trans = c(1,2,3,4,5), Cust01 = c("A","B","C","D","F"),
                   Cust02 = c("S","E","","TE","F"), Cust03 = c("F","","D","","F"))

cust_type <-data.frame(Cust = c("A","B","C","D"), Type = c("1","2","3","4"))

dataresult <- data.frame(Trans = c(1,2,3,4,5),
                         Cust01 = c("A","B","C","D","F"), 
                         Cust01Type = c("1","2","3","4","5"),
                         Cust02 = c("S","E","","TE","F"), 
                         Cust02Type = c("","","","",""),
                         Cust03 = c("F","","D","","F"),
                         Cust03Type = c("","","4","",""))

我想以有效的方式将客户类型添加到数据中。通常使用sql我将使用多个左连接,我尝试使用dplyr,但需要永久。我还尝试使用%in%进行逻辑返回,然后使用循环来专注于真值。 有人知道更好的方法吗?

1 个答案:

答案 0 :(得分:1)

当你想要快速的性能时,没有什么比data.table包更好了。由于您的交易数据现在采用宽格式,因此要做的第一步是将其转换为长格式。这样可以更容易处理。

library(data.table) #v1.9.5
trans_data <- melt(setDT(data), id.vars = "Trans",
                   variable.name = "Cust",               # set name variable column
                   variable.factor = TRUE,               # set as a factor variable instead of a character variable
                   value.name = "Cvalue")[!Cvalue==""]   # set name value column & remove empty cases

完成后,您可以加入两个数据表:

# set the keys by which you are joining
setDT(trans_data, key = "Cvalue")
setDT(cust_type, key = "Cust")

# join the customer type into the transaction data
trans_data[cust_type, Ctype:=Type]

这给出了:

> trans_data
    Trans   Cust Cvalue Ctype
 1:     1 Cust01      A     1
 2:     2 Cust01      B     2
 3:     3 Cust01      C     3
 4:     4 Cust01      D     4
 5:     3 Cust03      D     4
 6:     2 Cust02      E    NA
 7:     5 Cust01      F    NA
 8:     5 Cust02      F    NA
 9:     1 Cust03      F    NA
10:     5 Cust03      F    NA
11:     1 Cust02      S    NA
12:     4 Cust02     TE    NA

如果您想更改结果data.table中的顺序,可以使用例如:

setorder(trans_data, Trans, Cust)

或同时使用:

trans_data <- trans_data[cust_type, Ctype:=Type][order(Trans,Cust)]

给出:

> trans_data
    Trans   Cust Cvalue Ctype
 1:     1 Cust01      A     1
 2:     1 Cust02      S    NA
 3:     1 Cust03      F    NA
 4:     2 Cust01      B     2
 5:     2 Cust02      E    NA
 6:     3 Cust01      C     3
 7:     3 Cust03      D     4
 8:     4 Cust01      D     4
 9:     4 Cust02     TE    NA
10:     5 Cust01      F    NA
11:     5 Cust02      F    NA
12:     5 Cust03      F    NA

注意:我使用development version of data.table,不再需要reshape2来加载melt函数的fs.readFile包。