我有以下data.frame:
library(sparklyr)
library(dplyr)
testDF <- data.frame(A = c(1, 2, 3, 4, 5, 6, 7, 8),
B = c(10, 20, 30, 40, 50, 60, 70, 80),
C = c(100, 200, 300, 400, 500, 600, 700, 800),
D = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000))
创建后,我可以使用sparklyr将其复制到Spark中。
testDFCopied <- copy_to(sc, testDF, "testDF", overwrite = TRUE)
创建后,我可以mutate
列,使用函数lag
创建另一个列:
head(testDFCopied %>% dplyr::arrange(A) %>% dplyr::mutate(E = lag(A)), 10)
Source: query [?? x 5]
Database: spark connection master=yarn app=sparklyr local=FALSE
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 100 1000 NaN
2 2 20 200 2000 1
3 3 30 300 3000 2
4 4 40 400 4000 3
5 5 50 500 5000 4
6 6 60 600 6000 5
7 7 70 700 7000 6
8 8 80 800 8000 7
当我尝试使用mutate
创建多个列并使用函数lag
时,会出现问题。例如,在这里我想创建两个新的列E和F,它们是A列和B列的“滞后”:
head(testDFCopied %>% dplyr::arrange(A) %>% dplyr::mutate(E = lag(A), F = lag(B)), 10)
Source: query [?? x 6]
Database: spark connection master=yarn app=sparklyr local=FALSE
Error: org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
但是,如果我创建两个列但是我只使用lag
一次,则不会引发此异常,例如:
head(testDFCopied %>% dplyr::arrange(A) %>% dplyr::mutate(E = lag(A), F = C - B), 10)
Source: query [?? x 6]
Database: spark connection master=yarn app=sparklyr local=FALSE
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 100 1000 NaN 90
2 2 20 200 2000 1 180
3 3 30 300 3000 2 270
4 4 40 400 4000 3 360
5 5 50 500 5000 4 450
6 6 60 600 6000 5 540
7 7 70 700 7000 6 630
8 8 80 800 8000 7 720
出于某种原因,只有在lag()
操作中执行了两个mutate
调用时才会引发异常。我尝试了lag()
和lead()
的不同组合以及mutate
的不同排列(不成功)。所有这些都引发了同样的异常,我不明白。查看Spark代码,我可以看到此处引发了异常:
/**
* Check and add proper window frames for all window functions.
*/
object ResolveWindowFrame extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case logical: LogicalPlan => logical transformExpressions {
case WindowExpression(wf: WindowFunction,
WindowSpecDefinition(_, _, f: SpecifiedWindowFrame))
if wf.frame != UnspecifiedFrame && wf.frame != f =>
failAnalysis(s"Window Frame $f must match the required frame ${wf.frame}")
...
我知道它应该与窗口函数lag
的一些无法检查的条件有关,但我并不真正理解这里的根本问题。任何帮助/想法将不胜感激。