Question

我有一个使用spark（2.1.0）调用的java代码。整个计算（从json（作为数据集）读取 - >过滤一些数据 - >使用UDF计算新列）正确完成（使用ds.show()）。

在最后一步中，我想只过滤新列大于其他列的记录 - ds.where(col("new_col").gt(col("mycol")));

此操作失败并出现以下错误 - 由以下原因引起： java.lang.UnsupportedOperationException：无法计算表达式： some_col_name＃138L

我发现它很奇怪，因为some_col_name不是过滤的一部分（但是new_col的计算方式的一部分，当我把它写入文件时（过滤前），它写得正确。

有什么想法吗？已知问题？

我的代码的粗略示例

private static java.util.List<Row> runReport(SparkSession spark, String[] dateColumns, String input_files)
            throws AnalysisException {
        Dataset<Row> df = spark.read().json(input_files);
        Dataset<Row> ds = df.groupBy("user").pivot("data").sum("my_col");
        ds = ds.na().fill(0f);
        ds = ds.withColumn("some_col", lit(100L));
        final Dataset<Row> ds2 = ds;
        List<Column> columns = Arrays.stream(dateColumns).map(x -> ds2.col(x)).collect(Collectors.toList());
        Seq<Column> seqColumns = scala.collection.JavaConverters.
                asScalaIteratorConverter(columns.iterator()).asScala().toSeq();
        ds = ds.withColumn("arrayData", array(seqColumns));
        UDF1 avgArray = new UDF1<Seq<Long>, Long>() {
            public Long call(final Seq<Long> arr) throws Exception {
                // calculate average
            }
        };

        UDF3 newGoal = new UDF3<Seq<Long>, Long, Long, Long>() {
            public Long call(Seq<Long> steps, Long goal, Long average) throws Exception {
                boolean all_above_goal = true;
                // Goal does not change if the average is below the goal
                // some logic - returns long
                return 100L;
            }
        };

        spark.udf().register("avgArray", avgArray, DataTypes.LongType);
        spark.udf().register("newGoal", newGoal, DataTypes.LongType);
        ds = ds.selectExpr("userId", "goal", "newGoal(arrayData, some_col, avgArray(arrayData)) as new_some_col");
        ds = ds.where(col("col1").gt(col("col2")));
        return  ds.collectAsList();
    }

Answer 1

您的数据框"new_goal"中没有名为ds的列。

过滤器

1 个答案: