连接汇总的数据框

时间:2019-08-16 14:27:50

标签: scala apache-spark apache-spark-sql apache-spark-dataset

我有一个自我连接的数据框,我将其按ID分组,然后聚合以在新数据框中产生一个计数列。我使用相同的自连接数据框并在某些条件下对其进行过滤,然后进行分组和聚合以产生一个计数列。之后,我想将两个产生的数据框连接到id列上,以产生一个具有id列,count列和count after filter列的数据框。我得到列出的错误。

我尝试将合并的数据框与另一个数据框连接在一起,但出现相同的错误。

这是我正在使用的代码

val aggregatedDataframe=selfJoinedDataframe
      .groupBy("id")
      .count()

val truePositiveDataframe= selfJoinedDataframe.filter(row=>{
     val s1 = row.getAs[Seq[Any]]("points").map {case s: Row=>
       Point(s.getAs[Double]("latitude"),s.getAs[Double]("longitude")).toCartesian
     }
     val s2 = row.getAs[Seq[Any]]("points1").map {case s: Row=>
       Point(s.getAs[Double]("latitude"),s.getAs[Double]("longitude")).toCartesian
     }
     Distance.calculate(s1,s2) <= distanceThreshold

   })
     .groupBy("id")
      .agg(count($"id").as("filtered_count"))

truePositiveDataframe
      .join(aggregatedDataframe,"id")
      .withColumn("accuracy",col("filtered_count")/col("count"))
      .show()

这是错误消息

Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#192 missing from partition_number#143,id#116,_3#111 in operator !Project [partition_number#143, id#192, _3#111 AS signature#120]. Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.;;
Join Inner, (id#192 = id#116)
:- SubqueryAlias `original`
:  +- Aggregate [id#116], [id#116, count(1) AS count#180L]
:     +- Filter NOT (id#116 = id1#139)
:        +- Project [partition_number#112, trajectory_points#44, id#116, trajectory_points1#135, id1#139]
:           +- Join Inner, (partition_number#112 = partition_number#143)
:              :- Project [trajectory_points#44, partition_number#112, id#116]
:              :  +- Project [id#116, partition_number#112, signature#120, trajectory_points#44]
:              :     +- Join Inner, (id#116 = id#41)
:              :        :- Project [partition_number#112, id#116, _3#111 AS signature#120]
:              :        :  +- Project [partition_number#112, _2#110 AS id#116, _3#111]
:              :        :     +- Project [_1#109 AS partition_number#112, _2#110, _3#111]
:              :        :        +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._1 AS _1#109, assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2 AS _2#110, assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._3 AS _3#111]
:              :        :           +- MapPartitions com.swidan.lsh.Metrics$$Lambda$2497/0x0000000840f66040@89017e5, obj#108: scala.Tuple3
:              :        :              +- DeserializeToObject newInstance(class com.swidan.lsh.package$HashedTrajectorySignature), obj#107: com.swidan.lsh.package$HashedTrajectorySignature
:              :        :                 +- SerializeFromObject [assertnotnull(assertnotnull(input[0, com.swidan.lsh.package$HashedTrajectorySignature, true])).id AS id#91, assertnotnull(assertnotnull(input[0, com.swidan.lsh.package$HashedTrajectorySignature, true])).signature AS signature#92]

编辑

已解决

当我使用带有新列名的聚合数据帧的副本,并且必须使用新列名时,以某种方式工作,在加入之前,有点奇怪,因为无论如何联接都会返回一个新的数据帧,但这就是为什么这行得通。正确的连接片段如下

aggregatedDataframe
            .toDF("id","base_count")
            .alias("base")
            .join(truePositiveDataframe
              .toDF("id","true_count")
              .alias("true")
              ,$"base.id"=== $"true.id")
        .withColumn("recall",col("true.true_count")/col("base.base_count"))
        .show()

0 个答案:

没有答案