我编写了一个SQL查询,该查询实际上从表中查找重复的海拔以及其他唯一列。这是我的查询。我想将其转换为pyspark。
dup_df = spark.sql('''
SELECT g.pbkey,
g.lon,
g.lat,
g.elevation
FROM DATA AS g
INNER JOIN
(SELECT elevation,
COUNT(elevation) AS NumOccurrences
FROM DATA
GROUP BY elevation
HAVING (COUNT(elevation) > 1)) AS a ON (a.elevation = g.elevation)
''')
答案 0 :(得分:0)
在Scala上,可以通过Window实现,可以转换为Python:
val data = Seq(1, 2, 3, 4, 5, 7, 3).toDF("elevation")
val elevationWindow = Window.partitionBy("elevation")
data
.withColumn("elevationCount", count("elevation").over(elevationWindow))
.where($"elevationCount" > 1)
.drop("elevationCount")
输出为:
+---------+
|elevation|
+---------+
|3 |
|3 |
+---------+