Question

我正在运行一些代码，其相关的本质是：

library(SparkR)
library(magrittr)
sqlContext %>% sql("select * from tmp") %>% 
  gapply("id", function(key, x) {
    data.frame(
      id = key,
      n = nrow(x)
    )
  }, schema = structType(
    structField("id", "integer"),
    structField("n", "integer")
  ))

不幸的是，对于id的某些值，nrow的计算方法不正确。与（在数据的子集上）运行相比：

library(data.table)
tmp = sqlContext %>% sql('select * from tmp where id < 1000') %>% collect %>% setDT

然后运行（其中gapply_df是上面collect命令的gapply ed结果）：

gapply_df[tmp[ , .N, keyby = id], on = 'id'][N < n]
#      n   N
# 1: 276 138
# 2: 148  74
# 3: 122  61
# 4: 303 101
# 5: 266 133

我注意到n（左侧gapply）产生的n有时是实际正确的倍数（此处为2倍或3倍）（N在右边）。

导致这种情况的原因是什么，以及如何解决？我担心nrow实际上给出了正确答案（毕竟应该在本地data.frame上调用），并且我的数据已被复制/重复，这意味着我的其余分析也可能是错的。

对不起，我不能提供可重现的例子;这是我的sessionInfo()：

# R version 3.4.1 (2017-06-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.5 LTS
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.0
# locale:
# [1] C
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# other attached packages:
# [1] data.table_1.10.4 magrittr_1.5      knitr_1.16        SparkR_2.1.1     
# loaded via a namespace (and not attached):
# [1] compiler_3.4.1     markdown_0.8       tools_3.4.1
# [4] KernSmooth_2.23-15 stringi_1.1.5      highr_0.6
# [7] stringr_1.2.0      mime_0.5           evaluate_0.10.1

使用Zeppelin spark在2.1.1中投放。

gapply有时返回重复的组？

0 个答案: