使用Dataframe where子句时获取Exception

时间:2016-12-09 05:39:39

标签: sql apache-spark dataframe max spark-dataframe

我正在尝试对 Datarame 子句进行示例操作。

以下是我的示例表数据:

address       district
hyderabad      001
delhi          002
mumbai         003

现在我需要使用DataFrame评估地址,max(区)。

结果如下:

孟买003

解决方法

这是我到目前为止所尝试的代码..,

SparkConf conf = new SparkConf();
        conf.set("spark.app.name", "max");
        conf.set("spark.master", "local");
        conf.set("spark.ui.port", "7077");

        SparkContext  ctx=new SparkContext(conf);       
        SQLContext sqlContext = new SQLContext(ctx);
        DataFrame df = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("inferSchema", "true")
            .option("header", "true")
            .load("/Users/hadoop/Downloads/SacramentocrimeJanuary2006.csv");
        //df.registerTempTable("consumer");
        //Row[] result = df.orderBy("cdatetime").select("cdatetime","address").collect();
        //DataFrame a = df.select("address","district").agg(functions.count("district"),functions.col("address")).orderBy("address");
        DataFrame b =df.select("address","district").where("district=max(district)");
        b.show();
        }

这是我的例外:

Cannot evaluate expression: (max(input[1, IntegerType]),mode=Complete,isDistinct=false)
    at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.genCode(Expression.scala:233)
    at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.genCode(interfaces.scala:73)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:106)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:102)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:102)
    at org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:419)
    at org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:401)
    at org.apache.spark.sql.catalyst.expressions.EqualTo.genCode(predicates.scala:379)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:106)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:102)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:102)
    at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:42)
    at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:33)
    at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:635)
    at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:632)
    at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:242)
    at org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:71)
    at org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:70)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/12/09 10:50:57 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)

2 个答案:

答案 0 :(得分:0)

你应该使用“聚合”和“加入”来解决你的问题。像这样:

data.agg(max($"district").as("maxd")).as("d1").join(data.as("d2"), $"d1.maxd" === $"d2.district").select($"address",$"district").show()

数据是您的DataFrame。它对您有所帮助

答案 1 :(得分:0)

您可以在数据框架上使用排序功能&将它命令为降序。然后只需使用头部功能,您将获得所需的输出。

这是代码示例。

import org.apache.spark.sql.functions._
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/user/userapp/sample.csv");
val a = df.sort(desc("district")).head

这是一个输出。

enter image description here