有效计算广泛的Spark DF

时间:2017-12-14 17:16:22

标签: r apache-spark dplyr apache-spark-sql sparklyr

我有一个几千列的宽火花数据框,大约一百万行,我想计算行总数。到目前为止我的解决方案如下。我用了: dplyr - sum of multiple columns using regular expressionshttps://github.com/tidyverse/rlang/issues/116

library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)

sc1 <- spark_connect(master = "local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf")

col_eqn = paste0(colnames(wide_df), collapse = "+" )

# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
               col_eqn,
               ") as total FROM wide_sdf")

dbGetQuery(sc1, query)

# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))

wide_sdf %>% 
    transmute("total" := !!col_eqn2) %>%
        collect() %>%
            as.data.frame()

当列数增加时出现问题。在spark SQL上,它似乎一次只能计算一个元素,即(((V1 + V1)+ V3)+ V4)......)由于非常高的递归,这会导致错误。

有没有人有另一种更有效的方法?任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

你在这里运气不好。无论如何,你都会遇到一些递归限制(即使你使用SQL解析器,足够大的表达式也会使查询规划器崩溃)。有一些缓慢的解决方案可用:

  • 使用var sys = new FileStream(@"path", FileMode.Open, FileAccess.Read, FileShare.None); (代价转换为R和来自R):

    spark_apply
  • 转换为长格式并汇总(以wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) }) 和shuffle为代价):

    explode

为了提高效率,您应该考虑编写Scala扩展并直接在key_expr <- "monotonically_increasing_id() AS key" value_expr <- paste( "explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value" ) wide_sdf %>% spark_dataframe() %>% # Add id and explode. We need a separate invoke so id is applied # before "lateral view" sparklyr::invoke("selectExpr", list(key_expr, "*")) %>% sparklyr::invoke("selectExpr", list("key", value_expr)) %>% sdf_register() %>% # Aggregate by id group_by(key) %>% summarize(total = sum(value)) %>% arrange(key) 对象上应用sum,而不会爆炸:

Row

package com.example.sparklyr.rowsum

import org.apache.spark.sql.{DataFrame, Encoders}

object RowSum {
  def apply(df: DataFrame, cols: Seq[String]) = df.map {
    row => cols.map(c => row.getAs[Double](c)).sum
  }(Encoders.scalaDouble)
}