
时间:2014-05-04 19:31:05

标签: r merge data.table

我在将一个简单的data.table连接示例应用于更大的(10GB)数据集时遇到了问题。 merge()在具有较大数据集的data.frames上运行得很好,尽管我喜欢利用data.table中的速度。任何人都可以指出我对data.table(特别是错误信息)的误解吗?

这是一个简单的例子(派生自这个帖子:Join of two data.tables fails)。

# The data of interest.
(DT <- data.table(id    = c(rep(1154:1155, 2), 1160),
                  price = c(1.99, 2.50, 15.63, 15.00, 0.75), 
                  key   = "id"))

     id price
1: 1154  1.99
2: 1154 15.63
3: 1155  2.50
4: 1155 15.00
5: 1160  0.75

# Lookup table.
(lookup <- data.table(id      = 1153:1160, 
                      version = c(1,1,3,4,2,1,1,2), 
                      yr      = rep(2006, 4), 
                      key     = "id"))

     id version   yr
1: 1153       1 2006
2: 1154       1 2006
3: 1155       3 2006
4: 1156       4 2006
5: 1157       2 2006
6: 1158       1 2006
7: 1159       1 2006
8: 1160       2 2006

# The desired table.  Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]

     id price version   yr
1: 1154  1.99       1 2006
2: 1154 15.63       1 2006
3: 1155  2.50       3 2006
4: 1155 15.00       3 2006
5: 1160  0.75       2 2006


# Merge data.frames: works just fine
long.merged         <- merge(temp.versions, temp.3561, by = "id")

# Convert the data.frames to data.tables
DTtemp.3561         <- as.data.table(temp.3561)
DTtemp.versions     <- as.data.table(temp.versions)

# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged       <- merge(DTtemp.versions, DTtemp.3561, by = "id")

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate 
key values in i, each of which join to the same group in x over and over again. If that's ok, 
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the 
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. 
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.





# Same error message, but with 12,055,777 observations
altDTlong.merged   <- DTtemp.3561[DTtemp.versions]

# Same error message, but with 11,277,332 observations
alt2DTlong.merged  <- DTtemp.versions[DTtemp.3561]

包括allow.cartesian = T和nomatch = 0不会丢弃“多余”观察结果。


# Merge short DF: works just fine
short.3561         <- temp.3561[-(11:7946667),]
short.merged       <- merge(temp.versions, short.3561, by = "id")

# Merge short DT
DTshort.3561       <- data.table(short.3561, key = "id")
DTshort.merged     <- merge(DTtemp.versions, DTshort.3561, by = "id")


0 个答案:
