我有一个自我连接的数据框,我将其按ID分组,然后聚合以在新数据框中产生一个计数列。我使用相同的自连接数据框并在某些条件下对其进行过滤,然后进行分组和聚合以产生一个计数列。之后,我想将两个产生的数据框连接到id列上,以产生一个具有id列,count列和count after filter列的数据框。我得到列出的错误。
我尝试将合并的数据框与另一个数据框连接在一起,但出现相同的错误。
这是我正在使用的代码
val aggregatedDataframe=selfJoinedDataframe
.groupBy("id")
.count()
val truePositiveDataframe= selfJoinedDataframe.filter(row=>{
val s1 = row.getAs[Seq[Any]]("points").map {case s: Row=>
Point(s.getAs[Double]("latitude"),s.getAs[Double]("longitude")).toCartesian
}
val s2 = row.getAs[Seq[Any]]("points1").map {case s: Row=>
Point(s.getAs[Double]("latitude"),s.getAs[Double]("longitude")).toCartesian
}
Distance.calculate(s1,s2) <= distanceThreshold
})
.groupBy("id")
.agg(count($"id").as("filtered_count"))
truePositiveDataframe
.join(aggregatedDataframe,"id")
.withColumn("accuracy",col("filtered_count")/col("count"))
.show()
这是错误消息
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#192 missing from partition_number#143,id#116,_3#111 in operator !Project [partition_number#143, id#192, _3#111 AS signature#120]. Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.;;
Join Inner, (id#192 = id#116)
:- SubqueryAlias `original`
: +- Aggregate [id#116], [id#116, count(1) AS count#180L]
: +- Filter NOT (id#116 = id1#139)
: +- Project [partition_number#112, trajectory_points#44, id#116, trajectory_points1#135, id1#139]
: +- Join Inner, (partition_number#112 = partition_number#143)
: :- Project [trajectory_points#44, partition_number#112, id#116]
: : +- Project [id#116, partition_number#112, signature#120, trajectory_points#44]
: : +- Join Inner, (id#116 = id#41)
: : :- Project [partition_number#112, id#116, _3#111 AS signature#120]
: : : +- Project [partition_number#112, _2#110 AS id#116, _3#111]
: : : +- Project [_1#109 AS partition_number#112, _2#110, _3#111]
: : : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._1 AS _1#109, assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2 AS _2#110, assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._3 AS _3#111]
: : : +- MapPartitions com.swidan.lsh.Metrics$$Lambda$2497/0x0000000840f66040@89017e5, obj#108: scala.Tuple3
: : : +- DeserializeToObject newInstance(class com.swidan.lsh.package$HashedTrajectorySignature), obj#107: com.swidan.lsh.package$HashedTrajectorySignature
: : : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, com.swidan.lsh.package$HashedTrajectorySignature, true])).id AS id#91, assertnotnull(assertnotnull(input[0, com.swidan.lsh.package$HashedTrajectorySignature, true])).signature AS signature#92]
当我使用带有新列名的聚合数据帧的副本,并且必须使用新列名时,以某种方式工作,在加入之前,有点奇怪,因为无论如何联接都会返回一个新的数据帧,但这就是为什么这行得通。正确的连接片段如下
aggregatedDataframe
.toDF("id","base_count")
.alias("base")
.join(truePositiveDataframe
.toDF("id","true_count")
.alias("true")
,$"base.id"=== $"true.id")
.withColumn("recall",col("true.true_count")/col("base.base_count"))
.show()