Say I have a dataframe like so:
ID Media
1 imgix.com/20830dk
2 imgix.com/202398pwe
3 imgix.com/lvw0923dk
4 imgix.com/082kldcm
4 imgix.com/lks032m
4 imgix.com/903248
I'd like to end up with:
ID Media
1 imgix.com/20830dk
2 imgix.com/202398pwe
3 imgix.com/lvw0923dk
4 imgix.com/082kldcm
Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?
答案 0 :(得分:0)
Call getItem(0) to extract first element from the aggregated list
df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()
答案 1 :(得分:0)
Anton和pault是正确的:
df.drop_duplicates(subset=['ID'])
确实有用