我试图在Spark Java中使用分析/窗口函数last_value。
select sno, name, addr1, addr2, run_dt,
last_value(addr1 ignore nulls) over (partition by sno, name, addr1, addr2, run_dt order by beg_ts , end_ts rows between unbounded preceding and unbounded following ) as last_addr1
from daily
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.execution.WindowFunctionFrame;
SparkConf conf = new SparkConf().setMaster("local").setAppName("Agg");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<Stgdailydtl> daily = sc.textFile("C:\\Testing.txt").map(
new Function<String, Stgdailydtl>() {
private static final long serialVersionUID = 1L;
public Stgdailydtl call(String line) throws Exception {
String[] parts = line.split(",");
Stgdailydtl daily = new Stgdailydtl();
daily.setSno(Integer.parseInt(parts[0].trim()));
.....
return daily;
}
});
DataFrame schemaDailydtl = sqlContext.createDataFrame(daily, Stgdailydtl.class);
schemaDailydtl.registerTempTable("daily");
WindowSpec ws = Window.partitionBy("sno, name, addr1, addr2, run_dt").orderBy("beg_ts , end_ts").rowsBetween(0, 100000);
DataFrame df = sqlContext.sql("select sno, name, addr1, addr2, run_dt "
+ "row_number() over(partition by mach_id, msrmt_gbl_id, msrmt_dsc, elmt_dsc, end_cptr_dt order by beg_cptr_ts, end_cptr_ts) from daily ");
}
}
Exception in thread "main" java.lang.RuntimeException: [1.110] failure: ``union'' expected but `(' found
select stg.mach_id, stg.msrmt_gbl_id, stg.msrmt_dsc, stg.elmt_dsc, stg.elmt_dsc_grp_concat, row_number() over(partition by mach_id, msrmt_gbl_id, msrmt_dsc, elmt_dsc, end_cptr_dt order by beg_cptr_ts, end_cptr_ts) from stgdailydtl stg
^
at scala.sys.package$.error(package.scala:27)
我无法理解如何使用WindowSpec / Window对象。请对此提出建议。 谢谢你的帮助
答案 0 :(得分:3)
您正在混合使用数据帧语法和SQL语法 - 特别是您创建了一个WindowSpec,但之后却没有使用它。
导入org.apache.spark.sql.functions
以获取row_number
功能,然后创建您尝试选择的列:
Column rowNum = functions.row_number().over(ws)
然后使用数据框API选择它:
df.select(each, column, you, want, rowNum)
我的语法可能略有偏差,我习惯使用scala或python,但要点就是这样。