我正在尝试对 Datarame 子句进行示例操作。
以下是我的示例表数据:
address district
hyderabad 001
delhi 002
mumbai 003
现在我需要使用DataFrame评估地址,max(区)。
结果如下:
孟买003
解决方法:
这是我到目前为止所尝试的代码..,
SparkConf conf = new SparkConf();
conf.set("spark.app.name", "max");
conf.set("spark.master", "local");
conf.set("spark.ui.port", "7077");
SparkContext ctx=new SparkContext(conf);
SQLContext sqlContext = new SQLContext(ctx);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/Users/hadoop/Downloads/SacramentocrimeJanuary2006.csv");
//df.registerTempTable("consumer");
//Row[] result = df.orderBy("cdatetime").select("cdatetime","address").collect();
//DataFrame a = df.select("address","district").agg(functions.count("district"),functions.col("address")).orderBy("address");
DataFrame b =df.select("address","district").where("district=max(district)");
b.show();
}
这是我的例外:
Cannot evaluate expression: (max(input[1, IntegerType]),mode=Complete,isDistinct=false)
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.genCode(Expression.scala:233)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.genCode(interfaces.scala:73)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:102)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:102)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:419)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:401)
at org.apache.spark.sql.catalyst.expressions.EqualTo.genCode(predicates.scala:379)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:102)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:102)
at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:42)
at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:33)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:635)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:632)
at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:242)
at org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:71)
at org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:70)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/12/09 10:50:57 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
答案 0 :(得分:0)
你应该使用“聚合”和“加入”来解决你的问题。像这样:
data.agg(max($"district").as("maxd")).as("d1").join(data.as("d2"), $"d1.maxd" === $"d2.district").select($"address",$"district").show()
数据是您的DataFrame。它对您有所帮助
答案 1 :(得分:0)