Sparklyr中的Concat_ws()函数丢失

时间:2018-10-12 14:29:49

标签: r sparklyr

我正在跟踪有关Web(Adobe)分析的教程,我想在此建立马尔可夫链模型。 (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/)。

在示例中,他们正在使用以下功能: concat_ws (来自库(sparklyr))。但是它似乎不存在该功能(在安装软件包并调用库之后,我收到一个该功能不存在的错误……)。

博客的评论作者:concat_ws是Spark SQL函数: https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html 因此,您必须依靠sparklyr才能运行该功能。

我的问题:是否有变通办法来访问concat_ws()函数?我尝试过:

此功能的目标是什么? 使用给定的分隔符将多个输入字符串列连接为一个字符串列。

2 个答案:

答案 0 :(得分:2)

您只需在基数R中使用import spark.implicits._ val dataset = Seq((30, 2.0), (20, 3.0), (19, 20.0)).toDF("age", "size") import functions._ val a0 = dataset.withColumn("rank", rank() over Window.partitionBy('age).orderBy('size)) val a1 = a0.agg(avg('rank)) //a1.show() //OK //same thing but in one expression, crashes: val b = dataset.agg(functions.avg(functions.rank().over(Window.partitionBy('age).orderBy('size))))

Exception in thread "main" java.lang.StackOverflowError
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1.apply(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1.apply(TreeNode.scala:109)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1.apply(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1.apply(TreeNode.scala:109)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1$$anonfun$apply$1.apply(TreeNode.scala:109)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1.apply(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$find$1.apply(TreeNode.scala:109)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.trees.TreeNode.find(TreeNode.scala:109)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$hasWindowFunction(Analyzer.scala:1757)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1781)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$71.apply(Analyzer.scala:1781)
at scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
at scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314)
at scala.collection.AbstractTraversable.partition(Traversable.scala:104)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$extract(Analyzer.scala:1781)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$28.applyOrElse(Analyzer.scala:1950)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$28.applyOrElse(Analyzer.scala:1925)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) [...]

答案 1 :(得分:2)

您找不到函数,因为sparklyr软件包中不存在该函数。 concat_ws是Spark SQL函数(org.apache.spark.sql.functions.concat_ws)。

sparklyr取决于SQL转换层-函数调用使用dbplyr转换为SQL表达式:

> dbplyr::translate_sql(concat_ws("-", foo, bar))
<SQL> CONCAT_WS('-', "foo", "bar")

这意味着该功能只能在sparklyr上下文中应用:

sc <- spark_connect(master = "local[*]")
df <- copy_to(sc, tibble(x="foo", y="bar"))

df %>% mutate(xy = concat_ws("-", x, y))
# # Source: spark<?> [?? x 3]
#   x     y     xy     
# * <chr> <chr> <chr>  
# 1 foo   bar   foo-bar