Pyspark - 对包含列表列表的数据框列进行排序

时间:2017-04-12 09:48:11

标签: python-3.x apache-spark dataframe

我有一个如下的数据框 -

+----------+-------+-------------------------------------------------+
| WindowID | State |                                         Details |
+----------+-------+-------------------------------------------------+
|        6 |    SD | [[29916,3], [156570,4], [245934,1], [329748,8]] |
|        3 |    CO |              [[524586,2], [1548,3], [527220,1]] |
+----------+-------+-------------------------------------------------+

现在,我想根据列表的第二个元素按降序对Details列的每一行进行排序。结果应该是 -

+----------+-------+-------------------------------------------------+
| WindowID | State |                                         Details |
+----------+-------+-------------------------------------------------+
|        6 |    SD | [[329748,8], [156570,4], [29916,3], [245934,1]] |
|        3 |    CO |              [[1548,3], [524586,2], [527220,1]] |
+----------+-------+-------------------------------------------------+

我怎么能在pyspark做到这一点?提前谢谢。

2 个答案:

答案 0 :(得分:0)

我不知道你尝试了什么,但检查下面的解决方案,这对你有用。

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
from pyspark.sql.functions import udf

dfSchema = StructType([StructField('WindowID', IntegerType(), True),
                     StructField('State', StringType(), True),
                     StructField('Details', ArrayType(ArrayType(IntegerType())), True)])
#["WindowID", "State", "Details"]
mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]],
[3, 'CO', [[524586,2], [1548,3], [527220,1]]]], dfSchema)
mydf.show(truncate = False)

+--------+-----+---------------------------------------------------------------------------------------------------+
|WindowID|State|Details                                                                                            |
+--------+-----+---------------------------------------------------------------------------------------------------+
|6       |SD   |[WrappedArray(29916, 3), WrappedArray(156570, 4), WrappedArray(245934, 1), WrappedArray(329748, 8)]|
|3       |CO   |[WrappedArray(524586, 2), WrappedArray(1548, 3), WrappedArray(527220, 1)]                          |
+--------+-----+---------------------------------------------------------------------------------------------------+

def def_sort(x):
        return sorted(x, key=lambda x:x[1], reverse=True)

udf_sort = udf(def_sort, ArrayType(ArrayType(IntegerType())))
mydf.select("windowID", "State", udf_sort("Details")).show(truncate = False)


+--------+-----+---------------------------------------------------------------------------------------------------+
|windowID|State|PythonUDF#def_sort(Details)                                                                        |
+--------+-----+---------------------------------------------------------------------------------------------------+
|6       |SD   |[WrappedArray(329748, 8), WrappedArray(156570, 4), WrappedArray(29916, 3), WrappedArray(245934, 1)]|
|3       |CO   |[WrappedArray(1548, 3), WrappedArray(524586, 2), WrappedArray(527220, 1)]                          |
+--------+-----+---------------------------------------------------------------------------------------------------+

答案 1 :(得分:0)

我找到了解决这个问题的简单技巧 -

import operator

mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]],
           [3, 'CO', [[524586,2], [1548,3], [527220,1]]]],
           ['WindowID', 'State', 'Details']).show(truncate=False)

+----------+-------+-------------------------------------------------+
| WindowID | State |                                         Details |
+----------+-------+-------------------------------------------------+
|        6 |    SD | [[29916,3], [156570,4], [245934,1], [329748,8]] |
|        3 |    CO |              [[524586,2], [1548,3], [527220,1]] |
+----------+-------+-------------------------------------------------+

sorted_df = mydf.rdd.map(lambda x: [x[0], x[1], sorted(x[2], \ 
          key=operator.itemgetter(1), reverse=True)]) \
          .toDF(['WindowID', 'State', 'Details']) \
          .show(truncate=False)

+----------+-------+-------------------------------------------------+
| WindowID | State |                                         Details |
+----------+-------+-------------------------------------------------+
|        6 |    SD | [[329748,8], [156570,4], [29916,3], [245934,1]] |
|        3 |    CO |              [[1548,3], [524586,2], [527220,1]] |
+----------+-------+-------------------------------------------------+