我在将一个简单的data.table连接示例应用于更大的(10GB)数据集时遇到了问题。 merge()在具有较大数据集的data.frames上运行得很好,尽管我喜欢利用data.table中的速度。任何人都可以指出我对data.table(特别是错误信息)的误解吗?
这是一个简单的例子(派生自这个帖子:Join of two data.tables fails)。
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
较大的数据集由两个data.frames组成:temp.3561(感兴趣的数据集)和temp.versions(查找数据集)。它们分别具有与DT和查找(上面)相同的结构。使用merge()效果很好,但是我的data.table应用程序显然存在缺陷:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions具有与lookup相同的结构(在简单示例中),键“id”由779,473个唯一值(无重复)组成。
DTtemp3561具有与DT(在简单示例中)相同的结构以及一些其他变量,但其关键“id”仅具有829个唯一值,尽管有7,946,667个观察值(大量重复)。
由于我只是想将DTtemp.versions的版本号和年份添加到DTtemp.3561中的每个观察点,因此合并的data.table应该具有与DTtemp.3561(7,946,667)相同的观察数量。具体来说,我不明白为什么merge()在使用data.table时会生成“多余”的观察结果,但在使用data.frame时则不然。
同样地
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
包括allow.cartesian = T和nomatch = 0不会丢弃“多余”观察结果。
奇怪的是,如果我截断感兴趣的数据集有10个观察点,那么merge()在data.frames和data.tables上都能正常工作。
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
我经历了常见问题解答(http://datatable.r-forge.r-project.org/datatable-faq.pdf,特别是1.12)。你会怎么建议考虑这个?