在Spark数据帧上使用spark_apply与本地计算时,输出不同

时间:2019-05-06 23:40:07

标签: r sparklyr

对于Sparklyr来说我还很陌生,我正尝试使用spark_apply在Spark数据帧上使用CRAN包(ChannelAttribution)中的函数。我使用spark_apply获得的输出与在内存数据帧中正常使用该函数的输出不同。

library(sparklyr)
library(dplyr)
library(tibble)
library(ChannelAttribution)

sc <- spark_connect(master = "local")

# Define some sample paths which lead to conversion.
my_paths <- tibble(path = c("A > B > C",
                            "A > A",
                            "C > B > C",
                            "B > A > B > B"),
                   conversion = 1)

# Calculate markov conversion values normally.
ChannelAttribution::markov_model(my_paths,
                                 var_path = "path",
                                 var_conv = "conversion",
                                 order = 3)

# Copy to a Spark DataFrame without repartitioning, and use spark_apply to
# calculate the markov conversion values.
my_paths %>%
  sdf_copy_to(sc, ., "my_paths", repartition = 1) %>%
  spark_apply(function(df) {
    ChannelAttribution::markov_model(df,
                                     var_path = "path",
                                     var_conv = "conversion",
                                     order = 3)
  }) %>% 
  collect()

第一个输出是

channel_name   total_conversions
           A           1.5011965
           B           1.4990816
           C           0.9997219

spark_apply输出为

channel_name   total_conversions
           A                1.33
           B                1.33 
           C                1.33

对于这种情况的任何见解将不胜感激。

0 个答案:

没有答案