我有一个如下的数据框 -
+----------+-------+-------------------------------------------------+
| WindowID | State | Details |
+----------+-------+-------------------------------------------------+
| 6 | SD | [[29916,3], [156570,4], [245934,1], [329748,8]] |
| 3 | CO | [[524586,2], [1548,3], [527220,1]] |
+----------+-------+-------------------------------------------------+
现在,我想根据列表的第二个元素按降序对Details
列的每一行进行排序。结果应该是 -
+----------+-------+-------------------------------------------------+
| WindowID | State | Details |
+----------+-------+-------------------------------------------------+
| 6 | SD | [[329748,8], [156570,4], [29916,3], [245934,1]] |
| 3 | CO | [[1548,3], [524586,2], [527220,1]] |
+----------+-------+-------------------------------------------------+
我怎么能在pyspark做到这一点?提前谢谢。
答案 0 :(得分:0)
我不知道你尝试了什么,但检查下面的解决方案,这对你有用。
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
from pyspark.sql.functions import udf
dfSchema = StructType([StructField('WindowID', IntegerType(), True),
StructField('State', StringType(), True),
StructField('Details', ArrayType(ArrayType(IntegerType())), True)])
#["WindowID", "State", "Details"]
mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]],
[3, 'CO', [[524586,2], [1548,3], [527220,1]]]], dfSchema)
mydf.show(truncate = False)
+--------+-----+---------------------------------------------------------------------------------------------------+
|WindowID|State|Details |
+--------+-----+---------------------------------------------------------------------------------------------------+
|6 |SD |[WrappedArray(29916, 3), WrappedArray(156570, 4), WrappedArray(245934, 1), WrappedArray(329748, 8)]|
|3 |CO |[WrappedArray(524586, 2), WrappedArray(1548, 3), WrappedArray(527220, 1)] |
+--------+-----+---------------------------------------------------------------------------------------------------+
def def_sort(x):
return sorted(x, key=lambda x:x[1], reverse=True)
udf_sort = udf(def_sort, ArrayType(ArrayType(IntegerType())))
mydf.select("windowID", "State", udf_sort("Details")).show(truncate = False)
+--------+-----+---------------------------------------------------------------------------------------------------+
|windowID|State|PythonUDF#def_sort(Details) |
+--------+-----+---------------------------------------------------------------------------------------------------+
|6 |SD |[WrappedArray(329748, 8), WrappedArray(156570, 4), WrappedArray(29916, 3), WrappedArray(245934, 1)]|
|3 |CO |[WrappedArray(1548, 3), WrappedArray(524586, 2), WrappedArray(527220, 1)] |
+--------+-----+---------------------------------------------------------------------------------------------------+
答案 1 :(得分:0)
我找到了解决这个问题的简单技巧 -
import operator
mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]],
[3, 'CO', [[524586,2], [1548,3], [527220,1]]]],
['WindowID', 'State', 'Details']).show(truncate=False)
+----------+-------+-------------------------------------------------+
| WindowID | State | Details |
+----------+-------+-------------------------------------------------+
| 6 | SD | [[29916,3], [156570,4], [245934,1], [329748,8]] |
| 3 | CO | [[524586,2], [1548,3], [527220,1]] |
+----------+-------+-------------------------------------------------+
sorted_df = mydf.rdd.map(lambda x: [x[0], x[1], sorted(x[2], \
key=operator.itemgetter(1), reverse=True)]) \
.toDF(['WindowID', 'State', 'Details']) \
.show(truncate=False)
+----------+-------+-------------------------------------------------+
| WindowID | State | Details |
+----------+-------+-------------------------------------------------+
| 6 | SD | [[329748,8], [156570,4], [29916,3], [245934,1]] |
| 3 | CO | [[1548,3], [524586,2], [527220,1]] |
+----------+-------+-------------------------------------------------+