删除基于其他列pyspark的重复记录

时间:2018-05-03 20:15:08

标签: apache-spark pyspark

我在data frame中有一个pyspark,如下所示。

df.show()
+---+----+
| id|test|
+---+----+
|  1|   Y|
|  1|   N|
|  2|   Y|
|  3|   N|
+---+----+

我想在重复idtestN

时删除记录

现在,当我查询new_df

new_df.show()
+---+----+
| id|test|
+---+----+
|  1|   Y|
|  2|   Y|
|  3|   N|
+---+----+

我无法弄清楚用例。

我已经对id计数进行了groupby,但它只提供了id列和count

我在下面做过。

grouped_df = new_df.groupBy("id").count()

如何实现我想要的结果

  

修改

我有一个如下数据框。

+-------------+--------------------+--------------------+
|           sn|              device|           attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A|       Android Phone|                   N|
|4MY16A5W02DE8|       Android Phone|                   N|
|4MY16A5W02DE8|       Android Phone|                   Y|
|4VT1735J00337|                  TV|                   N|
|4VT1735J00337|                  TV|                   Y|
|4VT47B52003EE|              Router|                   N|
|4VT47C5N00A10|               Other|                   N|
+-------------+--------------------+--------------------+

当我这样做时

new_df = df.groupBy("sn").agg(max("attribute").alias("attribute"))

我收到str has no attribute alias错误

预期结果应如下所示

+-------------+--------------------+--------------------+
|           sn|              device|           attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A|       Android Phone|                   N|
|4MY16A5W02DE8|       Android Phone|                   Y|
|4VT1735J00337|                  TV|                   Y|
|4VT47B52003EE|              Router|                   N|
|4VT47C5N00A10|               Other|                   N|
+-------------+--------------------+--------------------+

4 个答案:

答案 0 :(得分:4)

不是最通用的解决方案,但应该很适合:

from pyspark.sql.functions import max

df = spark.createDataFrame(
  [(1, "Y"), (1, "N"), (2, "Y"), (3, "N")], ("id", "test")
)

df.groupBy("id").agg(max("test").alias("test")).show()
# +---+----+         
# | id|test|
# +---+----+
# |  1|   Y|
# |  3|   N|
# |  2|   Y|
# +---+----+

更通用的一个:

from pyspark.sql.functions import col, count, when

test = when(count(when(col("test") == "Y", "Y")) > 0, "Y").otherwise("N")

df.groupBy("id").agg(test.alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# |  1|   Y|
# |  3|   N|
# |  2|   Y|
# +---+----+

可以推广以容纳更多类和非平凡的排序,例如,如果您按此顺序评估了三个类Y?N,则可以:< / p>

(when(count(when(col("test") == "Y", True)) > 0, "Y")
     .when(count(when(col("test") == "?", True)) > 0, "?")
     .otherwise("N"))

如果还有其他列需要保留这些方法,那么您将需要Find maximum row per group in Spark DataFrame

中显示的内容

答案 1 :(得分:3)

使用row_number的另一个选项:

df.selectExpr(
    '*', 
    'row_number() over (partition by id order by test desc) as rn'
).filter('rn=1 or test="Y"').drop('rn').show()

+---+----+
| id|test|
+---+----+
|  1|   Y|
|  3|   N|
|  2|   Y|
+---+----+

此方法不会聚合,但只在测试为N

时删除重复的ID

答案 2 :(得分:0)

使用Spark SQL临时表,我使用了Databricks Notebook

case class T(id:Int,test:String)
val df=spark.createDataset(Seq(T(1, "Y"), T(1, "N"), T(2, "Y"), T(3, "N")))
df.createOrReplaceTempView("df")
%sql select id, max(test) from df group by id

enter image description here

答案 3 :(得分:0)

您可以使用以下代码:

#register as temp table
df.registerTempTable("df")

#create single rows
newDF = sqlc.sql(WITH dfCte AS 
(
    select *,row_number() over (partition by id order by test desc) as RowNumber
    from df
)
select * from dfCte where RowNumber =1)

#drop row numbers and show the newdf
newDF.drop('RowNumber').show()