PySpark 1.6.2 | orderBy / sort之后的collect()

时间:2017-05-02 13:16:47

标签: apache-spark pyspark spark-dataframe

我不了解这个简单的PySpark代码片段的行为:

# Create simple test dataframe
l = [('Alice', 1),('Pierre', 3),('Jack', 5), ('Paul', 2)]
df_test = sqlcontext.createDataFrame(l, ['name', 'age'])

# Perform filter then Take 2 oldest
df_test = df_test.sort('age', ascending=False)\
             .filter('age < 4') \
             .limit(2)


df_test.show(2)
# This outputs as expected :
# +------+---+
# |  name|age|
# +------+---+
# |Pierre|  3|
# |  Paul|  2|
# +------+---+

df_test.collect()
# This outputs unexpectedly :
# [Row(name=u'Pierre', age=3), Row(name=u'Alice', age=1)]

这是collect()函数的预期行为吗?如何将列检索为保持正确顺序的列表?

由于

1 个答案:

答案 0 :(得分:0)

我必须使用分拣机UDF来解决此问题

def sorter(l):
    import operator
    res = sorted(l, key =operator.itemgetter(0))
    L1=[item[1] for item in res]
    #return " ".join(str(x) for x in L)
    return "".join(L1)

sort_udf = udf(sorter)