Sparks2 / Java8 Cassandra2 尝试从Cassandra读取一些数据,然后通过sparks中的查询运行一个组。我的DF只有2列 transdate(Date),origin(String)
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY (origin,transdate) ORDER BY cnt DESC LIMIT 1"); `
获取错误:
`Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'origins.`origin`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value)`
按问题分组解决了删除组中的(),如下所示
完整代码:(试图获取原点/位置的最大转换日期)
JavaRDD<TransByDate> originDateRDD = javaFunctions(sc).cassandraTable("trans", "trans_by_date", CassandraJavaUtil.mapRowTo(TransByDate.class))
.select(CassandraJavaUtil.column("origin"), CassandraJavaUtil.column("trans_date").as("transdate")).limit((long)100) ;
Dataset<Row> originDF = sparks.createDataFrame(originDateRDD, TransByDate.class);
String[] columns = originDF.columns();
System.out.println("originDF columns: "+columns[0]+" "+columns[1]) ; -> transdate origin
originDF.createOrReplaceTempView("origins");
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY origin,transdate ORDER BY cnt DESC LIMIT 1");
List list = maxOrigindate.collectAsList(); -> Exception here
int j = list.size();
originDF 列:transdate origin
`public static class TransByDate implements Serializable {
private String origin;
private Date transdate;
public TransByDate() { }
public TransByDate (String origin, Date transdate) {
this.origin = origin;
this.transdate= transdate;
}
public String getOrigin() { return origin; }
public void setOrigin(String origin) { this.origin = origin; }
public Date getTransdate() { return transdate; }
public void setTransdate(Date transdate) { this.transdate = transdate; }
}
模式
root
|-- transdate: struct (nullable = true)
| |-- date: integer (nullable = false)
| |-- day: integer (nullable = false)
| |-- hours: integer (nullable = false)
| |-- minutes: integer (nullable = false)
| |-- month: integer (nullable = false)
| |-- seconds: integer (nullable = false)
| |-- time: long (nullable = false)
| |-- timezoneOffset: integer (nullable = false)
| |-- year: integer (nullable = false)
|-- origin: string (nullable = true)
异常 错误执行程序:阶段2.0(TID 12)中任务0.0中的异常 scala.MatchError:Sun Jan 01 00:00:00 PST 2012(类java.util.Date) at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251) at org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) .... 线程&#34; main&#34;中的例外情况org.apache.spark.SparkException:作业因阶段失败而中止:阶段2.0中的任务0失败1次,最近失败:阶段2.0中失去的任务0.0(TID 12,localhost):scala.MatchError:Sun Jan 01 00:太平洋标准时间00:00 00:00(类java.util.Date) at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) ... 驱动程序堆栈跟踪: 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1454) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1442) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1441) ... 在org.apache.spark.sql.Dataset $$ anonfun $ collectAsList $ 1.apply(Dataset.scala:2184) 在org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) 在org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2184) at spark.SparkTest.sqlMaxCount(SparkTest.java:244) - &gt;列表列表= maxOrigindate.collectAsList();
引起: scala.MatchError :太平洋标准时间01月01日00:00:00(类java.util.Date ) at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256) at org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
答案 0 :(得分:1)
将查询更改为
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin,
transdate,
COUNT(*) AS cnt FROM origins GROUP BY origin,transdate
ORDER BY cnt DESC LIMIT 1");
这将有效。
答案 1 :(得分:1)
您收到以下错误。
Caused by: scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date) at
此错误是因为Spark sql支持java.sql.Date
类型。请查看Spark文档here。您也可以参考SPARK-2562。