Question

TestDF是一个数据框。可以在循环内编辑/改变10次吗？

Spark没有可以编辑和保存在同一数据集中的选项。

java也没有动态变量赋值。

需要做这样的事情“数据集<Row> testDF +（i + 1）= testDF +（i）”（动态变量）或“数据集<Row> testDF = testDF”（在同一数据集中）里面for loop。

有没有办法在循环中循环火花DF？

String[] arraytest = schemaString.split(";");

for (int i=0;i < arraytest .length;i++) {
    String fieldName = arraytest[i];

    Dataset<Row> testDF+(i+1) = testDF+(i)
       .withColumn(fieldName, 
           functions.when(functions.col(fieldName).equalTo(""),"-99")
           .otherwise(functions.col(fieldName)));
    }

Answer 1

为此使用蓄电池

例如，您可以使用类似的

CollectionAccumulator<String> queryAccumulator = sparkSession.sparkContext().collectionAccumulator();


Dataset<Row> table= sparkSession.sql("select * from table");

    table.foreachPartition(partition->{
        while(partition.hasNext()){
            Row row = partition.next();
        String sql="select * from table where cond1= '"+row.getAs("cond1")+"' and cond2= '"+row.getAs("cond2")+"' order " +
                "by starttime desc limit 24";
            queryAccumulator.add(sql);}

    });
    logger.info("Queries to be executed are {}",queryAccumulator.value());
    Dataset<Row> limitedDataDf= queryAccumulator.value().stream().map(query-> sparkSession.sql(query)).reduce(Dataset::union).get();

    limitedDataDf.createOrReplaceTempView("SPARKJOBINFORMATION_LIMIT");

在for循环中构造Spark sql数据集

1 个答案: