因此,我必须从此数据集中获取n个(默认为3个)最大元素。如何在PySpark中以可接受的方式执行此操作?我知道如何在Pandas中执行此操作,但是我想知道如何在PySpark中有效执行此操作,或者是否可以高效执行。 我的第一个想法是使用pyspark.sql.functions这样的功能
ls = []
cols = df_tmp.columns[:-1]
for j in cols:
max_v = df_tmp.where(df_tmp["Variable"] == j).select(F.greatest(*[F.col(col) for col in cols]))
ls.append(max_v.collect()[0][0])
return ls.max
但这似乎是一个非常糟糕的方法,因为它会返回最大值(0.984),而不是组合值(Charlie,Foxtrot)。另外,我不知道如何在不重写单元格(Charlie,Foxtrot)中的值的情况下获得第二大价值,这是我认为您不应该在PySpark中做的事情。
感谢您阅读本文档,尤其是对可能回答问题的所有人:)
答案 0 :(得分:0)
您可以union从Alpha到Foxtrot的所有列来创建具有三列(数值,变量列,值的列名)的数据框。请看下面的例子:
import random
#creating a dataframe similiar to yours
columns = ['A','B','C','D','E','F']
l = [[random.random() if c!=r else None for c in range(6)] for r in range(6)]
l = [x + [columns[i]] for i,x in enumerate(l)]
df=spark.createDataFrame(l, columns)
df.show()
输出:
+-------------------+--------------------+--------------------+--------------------+-------------------+-------------------+---+
| A| B| C| D| E| F| _7|
+-------------------+--------------------+--------------------+--------------------+-------------------+-------------------+---+
| null| 0.37958341713258026| 0.31880755415785833| 0.8908555547489883|0.41632799280431776| 0.0729721304772899| A|
|0.21814744903713268| null|0.024393462170815394| 0.9940573571339111| 0.7841527980918188| 0.194722179975509| B|
| 0.786507586894131| 0.9155528558183477| null| 0.5782381547037391| 0.9714912596343181| 0.5446460767903856| C|
| 0.9108497603580163| 0.5088891113970719| 0.35594300627798736| null| 0.514258802933162|0.19317616393798986| D|
| 0.193214269992123| 0.6259176088252493| 0.4425532657461867|0.050484163355697276| null| 0.6594661109179668| E|
| 0.5567272189587709|0.020606558131312402| 0.21905184240270814| 0.2817064382900445| 0.5409619970394691| null| F|
+-------------------+--------------------+--------------------+--------------------+-------------------+-------------------+---+
import pyspark.sql.functions as F
newdf = df.select(F.col('A').alias('value'), F.col('_7').alias('row'), F.lit('A').alias('column'))
for col in columns[1:]:
newdf = newdf.union(df.select(col, '_7', F.lit(col)))
newdf.orderBy(newdf.value.desc()).show(3)
输出:
+------------------+---+------+
| value|row|column|
+------------------+---+------+
|0.9940573571339111| B| D|
|0.9714912596343181| C| E|
|0.9155528558183477| C| B|
+------------------+---+------+