Question

Say I have a dataframe like so:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm
4         imgix.com/lks032m
4         imgix.com/903248

I'd like to end up with:

ID         Media
1         imgix.com/20830dk
2         imgix.com/202398pwe
3         imgix.com/lvw0923dk
4         imgix.com/082kldcm

Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?

Answer 1

Group by on col('ID')
Use collect_list with agg to aggregate the list

Call getItem(0) to extract first element from the aggregated list

df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()

Answer 2

Anton和pault是正确的：

df.drop_duplicates(subset=['ID'])

确实有用

Remove duplicate rows, regardless of new information -PySpark

2 个答案: