我有一个主数据框,其中包含产品信息,例如ID,描述和类别(以及许多其他变量)。
main.df <- structure(list(product.ID = 1:10,
description = c("abc...", "bcd...", "def...", "efg...", "fgh...",
"ghi...", "hij...", "ijk...", "jkl...", "klm..."),
category = c("a", "b", "c", "d", "e", "a", "b", "c", "d", "e")),
.Names = c("product.ID", "description", "category"),
row.names = c(NA, -10L), class = "data.frame")
然后,我有第二个数据框,列出了每个特定类别所属的产品类别:
classes.df <- structure(list(category = c("a", "b", "c", "d", "e"),
classe = c("aaa", "bbb", "aaa", "ccc", "bbb")),
.Names = c("category", "classe"),
row.names = c(NA, -5L),
class = "data.frame")
“类别”变量是“链接”2个数据帧的内容。
我需要在main.df中添加一个变量来提及每行所属的产品类,但我不知道如何。
考虑到我的实际main.df是超过90,000个类别的450万行,而我的实际classes.df有90,000多行对应120个类,我该怎么做。 谢谢。
main.df结构是
Classes ‘data.table’ and 'data.frame': 250000 obs. of 16 variables:
$ ID : int 4722 6988 9184 13224 13511 15938 19244 21162 23294 23793 ...
$ dataset : Factor w/ 2 levels "BA", "RB",..: 1 1 1 1 1 1 1 1 1 1 ...
$ prodID : num 429 429 429 429 429 429 429 429 429 429 ...
$ ProdName : chr "aaa" "aaa" "bbb" "ccc" "eee" ...
$ manufacID : num 1 1 1 1 1 1 1 1 1 1 ...
$ time : num 1271636264 1062977828 1218368958 1305424000 1284596323 ...
$ serial : chr "BA1" "BA1" "RB1" "RB7" ...
- attr(*, "sorted")= chr "serial"
- attr(*, ".internal.selfref")=<externalptr>
classes.df结构是:
Classes ‘data.table’ and 'data.frame': 20565 obs. of 5 variables:
$ ID : int 652 1204 1252 1379 2334 2335 2336 2337 3186 3187 ...
$ mName : chr "XYZ" "EHD" "DLK" "TSH" ...
$ country: chr "Argentina" "USA" "UK" "Argentina" ...
$ serial : chr "RB7" "BA1" "RB97" "RB732" ...
- attr(*, ".internal.selfref")=<externalptr>
(出于保密原因,我不得不对名称进行匿名处理)
答案 0 :(得分:1)
尝试data.table
更大的数据集
library(data.table)
setkey(setDT(main.df), category)
setDT(classes.df)
main.df[classes.df][order(product.ID),]
# product.ID description category classe
#1: 1 abc... a aaa
#2: 2 bcd... b bbb
#3: 3 def... c aaa
#4: 4 efg... d ccc
#5: 5 fgh... e bbb
#6: 6 ghi... a aaa
#7: 7 hij... b bbb
#8: 8 ijk... c aaa
#9: 9 jkl... d ccc
#10: 10 klm... e bbb
或使用dplyr
library(dplyr)
left_join(main.df, classes.df, by='category')
base R
选项将使用merge
(会更慢)
merge(main.df, classes.df, by='category', all.x=TRUE)