Remove duplicate rows, regardless of new information -PySpark

时间:2018-06-04 17:04:18

标签: pyspark apache-spark-sql distinct pyspark-sql

Say I have a dataframe like so:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm
4         imgix.com/lks032m
4         imgix.com/903248

I'd like to end up with:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm

Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?

2 个答案:

答案 0 :(得分:0)

  1. Group by on col('ID')
  2. Use collect_list with agg to aggregate the list
  3. Call getItem(0) to extract first element from the aggregated list

    df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()
    

答案 1 :(得分:0)

Anton和pault是正确的:

df.drop_duplicates(subset=['ID']) 

确实有用