我有2个表,“交易”行超过500M,“客户”超过3M行
data <- data.frame(Trans = c(1,2,3,4,5), Cust01 = c("A","B","C","D","F"),
Cust02 = c("S","E","","TE","F"), Cust03 = c("F","","D","","F"))
cust_type <-data.frame(Cust = c("A","B","C","D"), Type = c("1","2","3","4"))
dataresult <- data.frame(Trans = c(1,2,3,4,5),
Cust01 = c("A","B","C","D","F"),
Cust01Type = c("1","2","3","4","5"),
Cust02 = c("S","E","","TE","F"),
Cust02Type = c("","","","",""),
Cust03 = c("F","","D","","F"),
Cust03Type = c("","","4","",""))
我想以有效的方式将客户类型添加到数据中。通常使用sql
我将使用多个左连接,我尝试使用dplyr
,但需要永久。我还尝试使用%in%
进行逻辑返回,然后使用循环来专注于真值。
有人知道更好的方法吗?
答案 0 :(得分:1)
当你想要快速的性能时,没有什么比data.table
包更好了。由于您的交易数据现在采用宽格式,因此要做的第一步是将其转换为长格式。这样可以更容易处理。
library(data.table) #v1.9.5
trans_data <- melt(setDT(data), id.vars = "Trans",
variable.name = "Cust", # set name variable column
variable.factor = TRUE, # set as a factor variable instead of a character variable
value.name = "Cvalue")[!Cvalue==""] # set name value column & remove empty cases
完成后,您可以加入两个数据表:
# set the keys by which you are joining
setDT(trans_data, key = "Cvalue")
setDT(cust_type, key = "Cust")
# join the customer type into the transaction data
trans_data[cust_type, Ctype:=Type]
这给出了:
> trans_data
Trans Cust Cvalue Ctype
1: 1 Cust01 A 1
2: 2 Cust01 B 2
3: 3 Cust01 C 3
4: 4 Cust01 D 4
5: 3 Cust03 D 4
6: 2 Cust02 E NA
7: 5 Cust01 F NA
8: 5 Cust02 F NA
9: 1 Cust03 F NA
10: 5 Cust03 F NA
11: 1 Cust02 S NA
12: 4 Cust02 TE NA
如果您想更改结果data.table
中的顺序,可以使用例如:
setorder(trans_data, Trans, Cust)
或同时使用:
trans_data <- trans_data[cust_type, Ctype:=Type][order(Trans,Cust)]
给出:
> trans_data
Trans Cust Cvalue Ctype
1: 1 Cust01 A 1
2: 1 Cust02 S NA
3: 1 Cust03 F NA
4: 2 Cust01 B 2
5: 2 Cust02 E NA
6: 3 Cust01 C 3
7: 3 Cust03 D 4
8: 4 Cust01 D 4
9: 4 Cust02 TE NA
10: 5 Cust01 F NA
11: 5 Cust02 F NA
12: 5 Cust03 F NA
注意:我使用development version of data.table
,不再需要reshape2
来加载melt
函数的fs.readFile
包。