r data.table(< = 1.9.4)加入行为

时间:2015-06-25 14:15:40

标签: r join data.table

我回过头使用r和data.table一段时间后我仍然遇到连接问题。我之前问this question得到了令人满意的解释,但我仍然没有得到逻辑。 我们来看一些例子:

library("data.table")
X <- data.table(chiave=c("a", "a", "a", "b", "b"),valore1=1:5)
Y <- data.table(chiave=c("a", "b", "c", "d"),valore2=1:4)
X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      b       4
5:      b       5
 Y
   chiave valore2
1:      a       1
2:      b       2
3:      c       3
4:      d       4

当我加入时,我收到错误:

 setkey(X,chiave)
 X[Y]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 7 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

这样:

 X[Y,allow.cartesian=T]
   chiave valore1 valore2
1:      a       1       1
2:      a       2       1
3:      a       3       1
4:      b       4       2
5:      b       5       2
6:      c      NA       3
7:      d      NA       4

请注意,X有重复的密钥而i没有。如果我将Y更改为:

 Y <- data.table(chiave=c("b", "c", "d"),valore2=1:3)
 Y
   chiave valore2
1:      b       1
2:      c       2
3:      d       3

加入完成时没有错误消息,也不需要allow.cartesian,但逻辑上情况相同:X有多个密钥而i没有。

 X[Y]
   chiave valore1 valore2
1:      b       4       1
2:      b       5       1
3:      c      NA       2
4:      d      NA       3

另一方面:

 X <- data.table(chiave=c("a", "a", "a", "a", "a", "a", "b", "b"),valore1=1:8)
 Y <- data.table(chiave=c("b", "b", "d"),valore2=1:3)
 X
   chiave valore1
1:      a       1
2:      a       2
3:      a       3
4:      a       4
5:      a       5
6:      a       6
7:      b       7
8:      b       8
 Y
   chiave valore2
1:      b       1
2:      b       2
3:      d       3

我在Xi都有多个密钥,但是联接(和笛卡尔产品)已完成,没有错误消息,也不需要allow.cartesian

 setkey(X,chiave)
 X[Y]
   chiave valore1 valore2
1:      b       7       1
2:      b       8       1
3:      b       7       2
4:      b       8       2
5:      d      NA       3

从我的角度来看,当且仅当我在X和i中都有多个键时才需要警告(不仅仅是因为结果表的行数多于max(nrow(x),nrow(i)))并且仅在这种情况下我认为需要allow.cartesian(所以不是我的前两个例子)。

1 个答案:

答案 0 :(得分:2)

Just to keep this answered, this behaviour with allow.cartesian has been fixed in the current development version v1.9.5, and will be soon available on CRAN as v1.9.6. Odd versions are devel, and even stable. From NEWS:

  1. allow.cartesian is ignored during joins when:

    • i has no duplicates and mult="all". Closes #742. Thanks to @nigmastar for the report.
    • assigning by reference, i.e., j has :=. Closes #800. Thanks to @matthieugomez for the report.

    In both these cases (and during a not-join which was already fixed in 1.9.4), allow.cartesian can be safely ignored.