使用mutate和lag

时间:2016-12-09 10:01:49

标签: r dplyr sparkr sparklyr

我有以下data.frame:

library(sparklyr)
library(dplyr)
testDF <- data.frame(A = c(1, 2, 3, 4, 5, 6, 7, 8), 
B = c(10, 20, 30, 40, 50, 60, 70, 80), 
C = c(100, 200, 300, 400, 500, 600, 700, 800), 
D = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000))

创建后,我可以使用sparklyr将其复制到Spark中。

testDFCopied <- copy_to(sc, testDF, "testDF", overwrite = TRUE)

创建后,我可以mutate列,使用函数lag创建另一个列:

head(testDFCopied %>% dplyr::arrange(A) %>% dplyr::mutate(E = lag(A)), 10)
Source:   query [?? x 5]
Database: spark connection master=yarn app=sparklyr local=FALSE

      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    10   100  1000   NaN
2     2    20   200  2000     1
3     3    30   300  3000     2
4     4    40   400  4000     3
5     5    50   500  5000     4
6     6    60   600  6000     5
7     7    70   700  7000     6
8     8    80   800  8000     7

当我尝试使用mutate创建多个列并使用函数lag时,会出现问题。例如,在这里我想创建两个新的列E和F,它们是A列和B列的“滞后”:

head(testDFCopied %>% dplyr::arrange(A) %>% dplyr::mutate(E = lag(A), F = lag(B)), 10)
Source:   query [?? x 6]
Database: spark connection master=yarn app=sparklyr local=FALSE

Error: org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING;
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)

但是,如果我创建两个列但是我只使用lag一次,则不会引发此异常,例如:

head(testDFCopied %>% dplyr::arrange(A) %>% dplyr::mutate(E = lag(A), F = C - B), 10)
Source:   query [?? x 6]
Database: spark connection master=yarn app=sparklyr local=FALSE

      A     B     C     D     E     F
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    10   100  1000   NaN    90
2     2    20   200  2000     1   180
3     3    30   300  3000     2   270
4     4    40   400  4000     3   360
5     5    50   500  5000     4   450
6     6    60   600  6000     5   540
7     7    70   700  7000     6   630
8     8    80   800  8000     7   720

出于某种原因,只有在lag()操作中执行了两个mutate调用时才会引发异常。我尝试了lag()lead()的不同组合以及mutate的不同排列(不成功)。所有这些都引发了同样的异常,我不明白。查看Spark代码,我可以看到此处引发了异常:

  /**
   * Check and add proper window frames for all window functions.
   */
  object ResolveWindowFrame extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan transform {
      case logical: LogicalPlan => logical transformExpressions {
        case WindowExpression(wf: WindowFunction,
        WindowSpecDefinition(_, _, f: SpecifiedWindowFrame))
          if wf.frame != UnspecifiedFrame && wf.frame != f =>
          failAnalysis(s"Window Frame $f must match the required frame ${wf.frame}")
...

我知道它应该与窗口函数lag的一些无法检查的条件有关,但我并不真正理解这里的根本问题。任何帮助/想法将不胜感激。

0 个答案:

没有答案