使用Pyspark中的行号查找重复项

时间:2019-07-16 12:13:27

标签: apache-spark pyspark pyspark-sql

我编写了一个SQL查询,该查询实际上从表中查找重复的海拔以及其他唯一列。这是我的查询。我想将其转换为pyspark。

dup_df = spark.sql('''
SELECT g.pbkey,
       g.lon,
       g.lat,
       g.elevation
FROM DATA AS g
INNER JOIN
  (SELECT elevation,
          COUNT(elevation) AS NumOccurrences
   FROM DATA
   GROUP BY elevation
   HAVING (COUNT(elevation) > 1)) AS a ON (a.elevation = g.elevation)
''')

1 个答案:

答案 0 :(得分:0)

在Scala上,可以通过Window实现,可以转换为Python:

val data = Seq(1, 2, 3, 4, 5, 7, 3).toDF("elevation")
val elevationWindow = Window.partitionBy("elevation")

data
  .withColumn("elevationCount", count("elevation").over(elevationWindow))
  .where($"elevationCount" > 1)
  .drop("elevationCount")

输出为:

+---------+
|elevation|
+---------+
|3        |
|3        |
+---------+